Using word context in documents to predict brain representation of meaning.

In an article titled Predicting Human Brain Activity Associated with the Meanings of Nouns, Tom Mitchell of Carnegie Mellon University and associates describe how they demonstrated how meanings of words are mapped in the brain.
The first step was to get examples of words in documents. They obtained a trillion-word collection.

It has been found that a word’s meaning is captured to some extent by the distribution of words and phrases with which it commonly co-occurs. For instance, the word ‘soup’ will often occur in documents that also have the word ‘spoon’, ‘eat’ and ‘dinner’. It would not be as likely to co-occur with words like ‘quasar’ or ‘supernova’. They then used the statistics they had gathered as follows:
Given an arbitrary stimulus word w, they encoded the meaning of w as a vector of intermediate semantic features. For example, one intermediate semantic feature might be the frequency with which w co-occurs with the verb ‘hear.’ The word ‘ear’ in the trillion word text corpus would presumably co-occur with a high frequency with ‘hear’, while the word ‘toe’ would have less of a co-occurrence frequency.

They limited themselves to 25 semantic features, so they had to pick and choose among words to use and they ended up with 25 verbs:
“see,” “hear,” “listen,” “taste,” “smell,” “eat,” “touch,” “rub,” “lift,” “manipulate,” “run,” “push,” “fill,” “move,” “ride,” “say,” “fear,” “open,” “approach,” “near,” “enter,” “drive,” “wear,” “break,” and “clean.”
These verbs generally correspond to basic sensory and motor activities, actions performed on objects, and actions involving changes to spatial relationships.

For each verb, the value of the corresponding intermediate semantic feature for a given input stimulus word w (they used nouns for ‘w’)  is the normalized co-occurrence count. For example if the word ‘ear’ occurs near ‘hear’ 1000 times in trillion word corpus, but occurs in general 3000 times, then the normalized value would be 1/3.  Then on top of that, the vector of features was normalized to unit-length, which means that if you plotted the vector in the 25 dimensions, the length of the vector would be ‘1’.

The next step was to take a ‘functional MRI” scan of the brain of volunteers being exposed to various verbs. For instance, the experimenters would obtain an MRI of the response to the word ‘hear’, which we’ll say for this example has a value for the semantic feature of ‘hear’ of 1/3. They separated the fMRI picture to a large number of tiny packed cubes called ‘voxels’ (analogous to a small square pixels in a 2-dimensional picture) and used regression to get a model that in this case associated the value (1/3) of the semantic feature ‘hear’ with every voxel.  So if they had N voxels, they would now have N weights that the regression had found associating ‘hear’ with the voxels.  They did this for all 25 features in the feature vector.

Once the model had learned correlations with voxels of a the feature vectors for a train set of words, they tried to predict what an fMRI would look like for new nouns – words that they had not trained the model on. They found that the result predicted fMRI neural activity well enough that it could successfully match words it had not yet encountered , with accuracies far above those expected by chance.

In a prediction, every voxel is a weighted sum of the 25 features in the presented word. Recall that each feature is obtained by co-occurrence statistics.

They caution that the assumption that brain activation is based on a weighted linear sum of contributions from each of its semantic features is debatable (the relation might not be linear) but they still got good results using that assumption. They also note that they are correlating features with an entire MRI, and then observing which areas seem to especially light up for a given meaning.

What would the vector for the word ‘hear’ look like? One of the features in that vector actually is the word ‘hear’, so its value would be one, and all the other values  in the vector would be zero.

You can present this word to the model, and see where it predicts the MRI would be activated.

They found that:

the learned fMRI signature for the semantic feature “eat” predicts strong activation in opercular cortex, which others have suggested is a component of gustatory cortex involved in the sense of taste. Also, the learned fMRI signature for “push” predicts substantial activation in the right postcentral gyrus, which is widely assumed to be involved in the planning of complex, coordinated movements. Furthermore, the learned signature for “run” predicts strong activation in the posterior portion of the right superior temporal lobe along the sulcus, which others have suggested is involved in perception of biological motion.
To summarize, these learned signatures cause the model to predict that the neural activity representing a noun will exhibit activity in gustatory cortex to the degree that this noun co-occurs with the verb “eat,” in motor areas to the degree that it co-occurs with “push,” and in cortical regions related to body motion to the degree that it co-occurs with “run.”

They also find that different people have commonalities in how they represent meanings, though it would be interesting to see if there were differences in very abstract words – for instance, does a group of Marxists represent ‘justice’ differently than a group of conservative economists?

They did find some differences even with concrete words:

in some cases the correspondence holds for only a subset of the nine participants. For example, additional features for some verbs for participant P1 include the signature for “touch,” which predicts strong activation in somatosensory cortex (right postcentral gyrus), and the signature for “listen,” which predicts activation in language-processing regions (left posterior superior temporal sulcus and left pars triangularis), though these trends are not common to all nine participants.

Each of the 25 features can be thought of as a dimension, just like an X/Y axis is used for 2 dimensions.


On each of the 25 axes is a value between 0 and 1, and together, they make a vector that ends at a point of distance 1 from the origin. That point is the location of a word in ‘meaning space’. In linear algebra, we know that ideal x and y axes are perpendicular (90 degrees and totally independent from each other), but even if they are not, as long as the angle between them is not zero, you can generate a 2D space from combinations of numbers on each axis that contains all the vectors as if the bases had been independent. In any event, Mitchell’s group did test some other choices of features, but the predictions of MRI scans using those features were not as good as the set mentioned above.

The next image shows part of a vector for ‘celery’ and for ‘airplane’


A co-occurrence method is not the only way to represent a meaning. A method used by Gustavo Sudre, who worked with Tom Mitchell, was using context (verb plus object for instance in a verb phrase) and parts of speech.


Another way to represent meaning is like the game ’20 questions’. You ask the same questions about a set of words, some will have an answer of 1 (yes) and some will have an answer of 0 ‘no’. So you can get a vector that captures some aspects of meaning that way as well.
When trying to decode a noun, using 218 questions, it was found that some questions were better than others.   It was also found that they often fell into 3 categories – size, manipulability, and animacy (alive or not)

This next image is interesting – Gustavo Sudre found that when a noun is represented in the brain, different semantic aspects peak at different times. Different parts of the brain would handle groups of semantic aspects, for instance, a motor area might handle the ‘move’ aspect of the word ‘push’. This raises a question of how all the relevant parts of the word are remembered and used – since they don’t peak at the same instant. One idea – at least as far as sentences, is that different aspects of meaning are part of a cyclical oscillation. To me that’s not a sufficient explanation. The different parts of meaning have to interact, and create a whole, in some way. But here is the image:
Mitchell’s student Leila Wehbe has also done work with entire sentences and fMRI, but that will be a subject for another post.
Sources: (Tom Mitchell video)

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s