A model of how the neocortex learns, rehearses items in working memory, predicts, and why it has oscillations.

The neocortex is responsible for sensory perception, cognition, generation of motor commands, spatial reasoning and language. There are new theories being developed on how it works. One theory comes from a company named Numenta. Numenta was founded by an entrepreneur named Jeff Hawkins who invented the Palm Pilot (the first commercially successful example of a hand-held computing device), and then devoted himself to modeling the brain. To understand his theory you should read the book on Numenta’s website (Biological and Machine Intelligence). This blog post is examining a modification of Numenta’s ideas by another entrepreneur, Max Bennett, a cofounder of an AI company called “Bluecore”. The material is taken from his paper: “An Attempt at a Unified Theory of the Neocortical Microcircuit in Sensory Cortex”. I should say that a blog post allows for easier reading, because I leave out his citations and skip the neuroscience findings that argue for or against his model. One risk of doing that is that I confuse what is known with what is conjectured. Therefore, if the post interests you, you will find more ideas and more accurate information in the article).

Let’s start with the basics. As you can see from the picture, a pyramidal neuron has dendrites that come from its apex and also has basal dendrites that come in laterally. The former are called ‘apical’ dendrites and are thought to receive feedback from higher layers. The basal dendrites in this theory are thought to carry contextual information. The two concentric circles that span the image show the area near the cell body – the proximal area, and an area further away (the distal area). Most synapses on a neuron are in the distal region. A single synapse on the proximal area may be able to fire the neuron, but a synapse in the distal area would weakly depolarize the dendrite, and the signal would be lost by the time it would reach the cell body (also known as the Soma)

This image has an empty alt attribute; its file name is neuronbasalproximal.png

We could ask, if a synapse on a distal segment of a dendrite cannot fire a neuron, then what is it for? The answer is that if it fires in combination with other nearby synapses on the same dendrite segment it can create a dendritic NMDA spike. This spike is an action potential but is weaker than the action potential that would be generated at the Soma. The dendritic spike can travel to the soma and if strong enough, perhaps in combination with other NMDA spikes, can set off an axonal action potential that then travels along the axon of the neuron and releases neurotransmitter at outgoing synapses. Even if the neuron is not strongly stimulated enough to fire an axon action potential, a dendrite action potential causes enough of a depolarization to cause learning at the dendritic segment. This allows each dendritic segment to learn which inputs tend to go together, and so the segment becomes a pattern recognizer.
You can think of each segment as taking a logical AND of its inputs because if all the inputs of a pattern fire above a threshold, the segment emits a spike. (This is a simplification because the combination of the inputs is not linear.) The logical OR in the diagram below shows that any one of the dendritic segments can send a spike from the dendrites.

The neocortex is about the size of a napkin folded on the outside of the cerebrum. It has about 100 billion cells, 20% of which are inhibitory. It is made up of six layers. (Bennett is modeling the “sensory” neocortex, which has different functions than the “frontal” neocortex.)

To quote Bennett:

“Much of the connectivity of the neurons within a given area of the sensory cortex is horizontally contained within a 300 to 600 micron wide column, although spanning vertically across all six layers This “cortical macrocolumn” of local horizontal connectivity has been proposed to be the canonical neocortical microcircuit. The human neocortex is thought to be made up of over a million such macrocolumns. In order to decipher the computations within a macrocolumn, we must interpret the observed connectivity of each of these types of neurons.”


“L4 and L2/3 of a macrocolumn can also be subdivided vertically into about 80 to 100 minicolumns, each of which is about 50 microns wide. In our model macrocolumn there is one L4 stellate cell per minicolumn and many L2/3 cells within a minicolumn. Cells within L5 and L6 are not mapped to a specific minicolumn, but rather perform computations across the entire macrocolumn”

So the macrocolumn is divided into minicolumns, but the minicolumns do not span the entire macrocolumn, they just span layer L2/3 and L4. Bennett draws a diagram with different layers having different colors. As you can see, L2/3 and L4 are lined up in minicolumns.

So what are the layers for?
Layer 1 (L1) is closest to the skull, and Layers 5 and 6 are the deepest.
L1 is the main target of cortical and subcortical inputs that provide “top-down” information for context-dependent sensory processing. Although L1 is devoid of excitatory cells, it contains the distal dendrites of pyramidal cells (PCs) located in deeper layers.
L4 is where inputs from senses or from lower layers in a cortical hierarchy come in.
In the diagram below, layer 5 is divided into two areas, L5a and L5b. L5a has Regular Spiking neurons (L5a-RS) and L5b has Intrinsic Bursting neurons (L5b-IB).
Layer 6 is divided into two types of neurons, L6a-CT (Cortico-Thalamic), and L6-CC (Cortico-Cortical). The dendrites of the L6a-CT neurons actually extend to layer 4, which Max Bennett thinks implies that they get the same input as the L4 cells.
There are also inhibitory neurons in the layers, and multiple types of excitatory neurons.

A more detailed picture from Max Bennett’s paper follows:

Numenta’s “Hierarchical Temporal Memory” (HTM) models propose that pyramidal neurons always exist in one of three states: inactive, predictive, and active. In an inactive state, the neuron is highly polarized. In an active state, a neuron is firing action potentials. In a predictive state, a neuron is subthreshold depolarized. The computational purpose of this predictive state is that if a proximal synapse has a presynaptic action potential, neurons in predictive states will fire before neurons in inactive states. In parts of the neocortex with extensive lateral inhibition, this will lead to neurons that were in a predicted state firing action potentials, but those that were in inactive states not firing at all because they get rapidly inhibited before they have a chance to depolarize.

Suppose the two dark colored minicolumns shown on the left stand for the letter ‘B’. All the cells in the minicolumn, whether in L4 or in L2/3, are firing. Numenta’s idea was that you could have only a subset firing. This subset might have been predicted in a prior step, using lateral connections from other minicolumns. So a sequence “A” followed by “B” might have the representation for “A” sending out lateral connections that in the next step, partially depolarize other neurons, some of which would be in the representation for “B”. The depolarization would not be enough to fire a neuron, but would predispose it to firing quickly, and suppressing the neurons in the column that do not have excitatory connections from the representation of “A”. The picture on the right shows a sparser representation of “B” which signifies that “A” happened prior.

You can create many sparse combinations for “B” preceded by some other letter(s) if you think of “B” as being one or more excited neurons in the two minicolumns in this example so that at least one neuron in each column fires. You could have A->B, D->B, H->Q->B, for example, all representable by the two columns.

Max Bennett’s model diverges from Numenta here because he doesn’t think the lateral connections from L2/3 neurons to other L2/3 neurons are responsible for predicting the next neurons that will fire, instead he thinks that L2/L3 neurons synapse on L5a-RS neurons, which in turn synapse back on L2/L3 neurons, and so predictions are made by this indirect route.

I should say here that Numenta’s model so far doesn’t explain many of the higher functions of the neocortex. But they do show how sequences can be learned and recognized, and how objects can be recognized.

Bennett says this:

“…prior HTM models have not incorporated working memory. Second, it appears evident that different macrocolumns coordinate processing together at precise timescales, otherwise it would be impossible for macrocolumns organized in a hierarchy to integrate information accurately. However, I am unaware of a neural circuit model that explains how such precisely timed coordination occurs. Third, prior HTM models do not explicitly incorporate attention. As such, I seek to present a model that can perform the same computations of prior HTM models, but can also:
(1) perform working memory and connect sequences separated by long time intervals.
(2) coordinate its activity and processing with other macrocolumns and structures on extremely precise time intervals.
(3) can be modulated by attention.”

Much of the communication between macrocolumns in the sensory neocortex is done via the thalamus. Located deep in the brain, the thalamus is classically known for its roles as a sensory relay in visual, auditory, and somatosensory systems. It also has significant roles in motor activity, emotion, memory, arousal, and other sensorimotor association functions. The thalamus plays a big role in Bennett’s model.

From his article:

“The thalamus is primarily made up of excitatory thalamocortical relay neurons. Recent experimental studies have shown that there are three categories of these thalamocortical relay neurons within sensory thalamus: ‘‘Core Neurons,’’ ‘‘Multiareal Matrix Neurons,’’ and ‘‘Local Matrix Neurons’’. Each of these has different connectivity with the neocortex. Core neurons project directly to L4-ST neurons. Local Matrix Neurons project to layer 1 within a single level of the cortical hierarchy. Multiareal matrix neurons project to L5a, L1, and L3 across different levels of the hierarchy. Multiareal matrix neurons are also the only one of the three types of relay neurons that project directly to the striatum and the amygdala.”

Bennett continues:

“The thalamus is organized hierarchically, with ‘‘first order relay nuclei’’ passing information directly from peripheral senses (e.g., sight, sound, touch) to ‘‘first order neocortex,’’ while higher order relay nuclei pass information between different levels of neocortex within the hierarchy. Early on in this hierarchy, thalamic nuclei are separated by modalities, with separate nuclei for vision, audition, and somato-sensation.”

Then he explains how the macrocolumns connect to the thalamus. Recall that inputs to a macrocolumn feed in in the L4 layer, and they go there via “core relay neurons” in the thalamus. It seems that L6a-CT neurons project back down to those core relay neurons. L5b-IB neurons project from the macrocolumn to a higher level macrocolumn, and they do this via other core neurons that project to L4 in the higher level macrocolumn. I’ll show a diagram, but first lets read Bennett’s description:

“The nature of the connectivity between the thalamus and macrocolumns provides clues as to the computations that are being performed in thalamocortical networks. L5b-IB neurons provide driving input (synapses close to soma) to core relay neurons that project to L4 in other ‘‘higher-order’’ macrocolumns These higher-level macrocolumns seem to repeat the same pattern of relaying their L5b-IB output through even higher-level thalamic relays to even higher-level macrocolumns.
There is also evidence to suggest that L5b-IB neurons provide driving input to local matrix neurons, which project back to L1 in the originating macrocolumn. In contrast to L5b-IB neurons, L6a-CT neurons provide modulatory input (synapses far away from the soma) back to the relay neurons that projected to L4-ST neurons in a given macrocolumn. These L6a-CT projections are generally thought of as the origin of ‘‘top-down’’ signals. They are not able to drive action potentials in thalamic relay neurons on their own, but they can increase the firing rate of an already activated thalamic relay neuron via these modulatory synapses or put them into a subthreshold predictive state.
Surrounding the thalamus is a thin sheet of inhibitory neurons called the thalamic reticular nucleus (‘‘TRN’’). There are two classes of inhibitory neurons within TRN: PV neurons and SOM neurons. PV neurons inhibit core relay neurons while SOM neurons inhibit matrix neurons. PV neurons receive input from L6a-CT neurons in the neocortex, while SOM neurons do not receive any input from the neocortex.

Again Bennett:

“Layer 4 stellate (‘‘L4-ST’’) neurons are the receiver of bottom-up input from lower-order cortical areas, primarily passing information up from the thalamus). Similar to previous models, I propose that L4-ST neurons perform coincidence detection on this bottom-up input. L4-ST neurons provide strong driving input to all L2/3 cells within its minicolumn. This means that whenever a specific coincidence of input is detected in L4-ST neurons, an entire L2/3 minicolumn will be activated. Experimental evidence for this simple form of coincidence detection in L4-ST cells can be seen directly in their response properties. Input to L4-ST cells in V1 comes from first-order visual thalamus (LGN), which respond to on-center off-surround circular stimuli in specific locations in their receptive field. However, L4-ST neurons in V1 primarily respond to bars of light of specific orientations. This is exactly what would be expected if L4-ST neurons performed coincidence detection on their bottom-up input. A bar of light in a specific orientation is simply a coincidence of a specific set of on-center, off-surround circles.”

“ The pyramidal neurons found in L2/3 (‘‘L2/3-PY’’ neurons) have basal dendrites that extend laterally throughout the entire macrocolumn. They have apical dendrites that extend throughout L1 in the macrocolumn…..I propose the computation of individual L2/3-PY neurons is as described by the ‘‘HTM model neuron’’: basal dendrites receive ‘‘contextual’’ modulatory input from other L2/3-PY neurons, whereas apical dendrites receive ‘‘top-down’’ modulatory input from other macrocolumns and higher-order thalamus. Excitation of either apical or basal dendrites of L2/3-PY neurons does not provide sufficient depolarization to drive somatic depolarization. However, such subthreshold excitation can modulate the sensitivity of these neurons to L4-ST input.

“When other L2/3-PY neurons synapse directly onto L2/3- PY neuron dendrites, they provide excitatory contextual modulatory input. When they instead synapse first onto inhibitory interneurons, they provide inhibitory contextual modulatory input. I propose that this excitatory and inhibitory recurrent connectivity enables the L2/3-PY cell network to operate as a winner-take-all competitive network.

“Suppose a macrocolumn has learned two coincident patterns in L4-ST neurons—one pattern for ‘‘A’’ and one pattern for ‘‘B’’. This model proposes that the L4-ST neurons that respond to ‘‘A’’ will activate a set of minicolumns in L2/3, whereas the different pattern of L4-ST neurons that respond to ‘‘B,’’ will activate a different set of minicolumns in L2/3. I propose that the cells in a minicolumn active within ‘‘A’’ will provide excitatory input to neurons in other minicolumns also active in ‘‘A’’ while providing inhibitory input to neurons in minicolumns that are not active during ‘‘A’’ (such as those for ‘‘B’’). This effectively implements a competitive network, where cells responsive to ‘‘A’’ will excite other cells responsive to ‘‘A’’ while inhibiting those responsive to other stimuli.”

So to recap what Bennett is saying here, if a representation of “A” in L2/L3 consists of more than one column, then those columns excite each other. They also inhibit columns that are part of other representations.

Bennett also thinks this gives a mechanism for resolving inputs that could be interpreted multiple ways:

“This means that if ambiguous or conflicting coincidence detection occurs (i.e., both ‘‘A’’ and ‘‘B’’ are input into the network simultaneously), the competitive network in L2/3 will force only one representation to be active . Furthermore, top-down excitation enables higher cortical regions to bias L2/3 representation, allowing for patterns with less bottom-up input to still win. Note that top-down bias cannot create a representation if there is no bottom-up evidence at all, it can only bias representations. This is consistent with intuition—consider the famous duck or rabbit example. This image can be seen as either a duck or a rabbit, but you can’t see a unicorn. Top-down bias can shift network states between representations that have some bottom-up evidence but not to representations with no bottom-up evidence.”

The next figure shows this disambiguation idea. Part A (top) shows a representation for the letter “A” that consists of two columns. Part A bottom shows a representation of the letter “B” that also consists of two columns, but different columns than in “A”. Part B top shows that the two columns of the letter “A” excite each other. Part B bottom shows that the two columns of the letter “A” also inhibit the two columns of the letter “B”. Part “C” shows blue filled in circles in layer L4, with the dots for the letter “A” being a darker color – which is to indicate stronger evidence coming from the senses or from a lower level – than the evidence shown by the dots for the letter “B”. “A” outcompetes “B”, and suppresses “B” in this case, unless, as shown in Part “D”, top down information strengthens the L2/L3 representation of “B” enough so that it wins the competition. If the letter “B” wins, it suppresses the L2/L3 columns for the letter “A”. L4 is not affected by this top down disambiguation, but L2/L3 is.

Bennett notes that a single L2/L3 pyramidal neuron’s axon branches widely in L5, providing input to many L5a-RS neurons throughout a macrocolumn. L5a-RS neurons send a massive projection back to L2/3 neurons, synapsing both on pyramidal neurons and inhibitory interneurons throughout the macrocolumn

Suppose you input a rapid sequence of already known patterns (e.g., A, B, and C) into a macrocolumn; and suppose each element follows each other immediately with no delay. The figure below shows a series of proposed steps to learn the sequence “A,B,C”. In the figure, two column represent each letter, but that is and arbitrary number just for explanatory purposes.
Step 1:
The letter “A” is input on L4 (the blue dark dots) and this creates a representation for “A” in L2/L3 (the green dots). The pattern in L2/L3 projects down to the L5a-RS neurons, and activates some of them. The assumption is that there is a unique pattern of L5a-RS neurons that fires for every possible pattern in L2/L3.
Step 2:
The pattern of L5a-RS neurons sends axons to L2/L3, biasing (but not firing) an arbitrary pattern of neurons in L2/L3.
Step 3:
An input of the letter “B” comes in via L4. This activates the pattern for the letter “B” in L2/L3. Due to the prior biasing some of the neurons in the two columns shown for “B” are inhibited, some others are biased to be excited, so the pattern in those two columns is unique for “B preceded by A”. (there is a light error in the figure, the circles that just have outlines in L5a-RS should have been filled in).
Step 4:
The unique pattern for “B preceded by A” activates a pattern in L5a-RS.
Step 5:
The pattern in L5a-RS biases various neurons L2/L3. Then Input “C” comes in via L4, which in turn fires two columns in L2/L3. Some of the neurons in those two columns are biased to fire, other are biased to be inhibited. So the pattern in these two columns represent “C preceded by A,B”

In the figure Step 3 and Step 5 also have a note (underneath) that says that Hebbian plasticity (in this case spike timing dependent plasticity or STDP) occurs between the L5a-RS neurons and the sparse code that they biased. In other words, when “B” or “C” comes in, the L5a-RS neurons that biased the two columns also learn to fire those two columns in the same pattern that was created by the biases. This means that the sequence can be replayed and has been learned.

Bennett calls the process of biasing by L5a-RS neurons of L2/L3 neurons “sequence biasing”. Biasing is subthreshold activation and here the pattern produced by biasing (in this case in the two columns representing a sequence) is reflective of what came before in the sequence.

A problem for the above network to learn the sequence ‘‘ABC,’’ is that the patterns must be input rapidly within the < 100 ms time window for this short-term synaptic potentiation (the Hebbian plasticity) which is not realistic. Later in Bennet’s article, he comes up with a mechanism that would make longer time scales work – for instance “A” followed by a 5 second pause followed by “B” followed by a 5 second pause followed by “C”. His approach has a limitation that a sequence “A->B->C” would not be distinguishable from “A->B->B->C” unless some other factor is in play.

We talked about L5a-RS neurons. What do L5b-IB neurons do?

Bennett proposes that L5b-IB neurons perform pattern separation on the L2/3 macrocolumn code, meaning that the L5b-IB code is sensitive to the sequence representation in L2/3, not just the column representation. He proposes that this ‘‘unique sequence code’’ is the core output code of a macrocolumn. Its true that L5a-RS has a pattern that is also unique to a letter in a sequence, because as the sequence unfolds, it see saws between L2/L3 and L5a-RS, with sparse unique representations in L2/L3 leading to unique representations in L5a-RS. However L5b-IB can learn to represent the entire learned sequence, even as it just starts to be input. For instance, when “A’ is input, L5b-IB will output a code that stands for the learned sequence “A->B->C”. In the model, if “A” is also the start of another learned sequence, say “A->Q->W”, then that sequence can also be output at the same time from L5b-IB. This is one advantage of sparse patterns, because sparsity allows several patterns to be expressed at once without the patterns sharing neurons and interfering with each other. Here is a diagram that shows L2/L3 sending signals that create a unique pattern in L5b-IB, which in turn passes it to higher level macrocolumns via the thalamus. L5b-IB neurons also sends the pattern to other areas.

Here is a diagram from Bennett that shows L5b-IB patterns for different sequences before learning:

After learning, the sequence code output is for the entire sequence A->B->C, not just the single-element sequence “A” or the two-element sequence “A->B”:

Many models of the neocortex see it as a predictor of what comes next. Bennett proposes that layer 6a corticothalamic (‘‘L6a-CT’’) neurons encode predictions of the upcoming stimuli a macrocolumn expects. The observed connectivity of L6a-CT neurons is consistent with this. L6a-CT neurons have apical dendrites in L4 (red in the diagram below), where they have access to direct input from core thalamic neurons. Dendritic NMDA spikes in these apical dendrites can learn the same patterns that L4-ST dendrites do. In the diagram, L5b-IB neurons also synapse on L6a-CT neurons. L6a-CT neurons combine this information with the information from L4 (which is the input from the senses or from lower macrocolumns in the cortex hierarchy) and use it to send predictions back to L4. Those predictions are modulatory – they bias neurons, but they do not fire neurons.

Bennet thinks that L6a-CT neurons, just like L2/L3 neurons, are connected in a ‘winner take all’ fashion. They can be put in a predictive state (in this case by their apical dendrites coming from L4), and then the driving input from L5b-IB is sets off columns in L6a-CT that become sparse and unique because of the information from L4. The two pieces of information are the sequence the macrocolumn is currently in (that comes from L5b-IB) and the input that just come into the macrocolumn (that comes from the apical dendrites that reach into L4). If the macrocolumn “believes” it is in sequence “X,Y,Z” and the dendrites from L4 convey that “Y” has just come in, then L6a-CT will predict “Z” and send modulatory signals to L4 to indicate that “Z” is expected.

Bennett puts it this way:

“Suppose a specific coincident pattern of input is received by a macrocolumn. This puts a specific pattern of L4-ST neurons into an active state, as well as putting a specific pattern of L6a-CT neurons into a predictive state. Furthermore, suppose a given L5b-IB sequence code sends driving input to a random subset of L6a-CT neurons. When the L5b-IB sequence code fires, only the predicted L6a-CT neurons receiving L5b-IB input will become active, the rest will be inactivated by lateral inhibition. This generates a sparse L6a-CT code that is unique to a specific element within a specific sequence”

The predictions bias the neurons, and eventually, via Hebbian plasticity, learn to fire them.

Bennett adds

“The random L6a-CT pattern that gets activated by ‘‘B’’ in the sequence ‘‘ABC’’ will fire right before the core thalamic neurons and L4-ST neurons for ‘‘C,’’ hence building short-term Hebbian plasticity with both of these neurons. Hence if the sequence ‘‘ABC’’ is replayed a sufficient quantity of times, these sparse L6a-CT codes will build long-term plasticity with the core thalamic and L4-ST neurons that tend to follow them, hence reliably predicting the upcoming element in a learned sequence.

In this figure from Bennett’s article, L6a-CT’s sparse pattern (step 2) sends predictions to L4 (step 3).

There is another use for predictions from L6a-CT

Bennett writes:
“I propose that another function of the frontal projection to L6a-CT and L2/3-PY neurons in the sensory cortex is to enable top-down attention. I use the term ‘‘top-down attention’’ to refer to two abilities–the ability of a subject to toggle between different possible interpretations of ambiguous stimuli and the ability of a subject to search an environment for specific features or objects (e.g., ‘‘where’s waldo?’’).

So Bennett thinks that attention is fundamentally a process of biased competition in a winner-take-all network.

He also has an explanation for working memory, (working memory’s definition includes memory for sequences, or for lists)
He notes:

“A recent experimental study showed that L6a-CT neurons provide strong driving input to L5a-RS neurons, eliciting action potentials directly. If macrocolumns work as proposed here, then the activated L5a-RS neurons will activate L2/3-PY neurons. This means that if the frontal cortex triggers a specific L6a-CT representation (bloggers note: in the sensory cortex), then simultaneously it will trigger a corresponding L2/3-PY representation via L5a-RS neurons. In other words, a frontal projection to L6aCT can trigger and maintain L2/3-PY representations without sensory input.”

The actual working memory would be a sequence playing out in a macrocolumn via the ‘see saw’ between L5a-RS and L2/L3.

Many sequences would play simultaneously in many macrocolumns, so coordinating them in time would be important.

For working memory to play out in L2/L3, L4 would have to temporarily be turned off. (Recall that L4 gets inputs from the outside world directly or indirectly and it would interfere with a replay of memory in this model because it synapses on L2/L3 neurons). Bennett explains how this might be accomplished but before I discuss that here, I should describe why the Hippocampus appears in the above diagram.

Bennett writes:

“I propose that the hippocampus is an essential component of this process. CA1 within the hippocampus has been shown to replay place codes on the gamma rhythm during working memory tasks “
(bloggers interruption: gamma waves are high frequency: 25-140 Hz while theta is relatively low frequency: 4–7 Hz and alpha is somewhat higher than theta 8–12 Hz. A place cell is a kind of pyramidal neuron within the hippocampus that becomes active when an animal enters a particular place in its environment, which is known as the place field. Place cells are thought, collectively, to act as a cognitive representation of a specific location in space.)

“CA1 of the hippocampus provides an extensive excitatory projection to the frontal cortex. If CA1 triggers replay in the frontal cortex, then the corresponding representations within the sensory cortex could also be replayed due to already described frontal projection to L6a-CT neurons.”

So one possibility here is that as a mammal moves through an environment, and the different place cells in the hippocampus fire, those place cells are telling each macrocolumn to remember an episode related to the various places it passes through.

One problem that has to be solved is coordination:

“Let us now turn to answer the question of how the brain coordinates processing across macrocolumns on precise timescales. Processing on precise time scales is an essential requirement for networks of macrocolumns. Postsynaptic excitation after presynaptic excitation across a single synapse, in the absence of successfully driving a postsynaptic spike, typically decays for 10–30 ms). This means that in order for dendritic segments to sum inputs across multiple synapses, presynaptic neurons must fire action potentials within a precise time window.

I propose processing on precise timescales is made possible by macrocolumns oscillating back and forth between an ‘‘input state’’ and an ‘‘output state.’’ The inherent circuit dynamics within the thalamus ensure that macrocolumns oscillate between these states at the same time, enabling coordinated processing. Within the thalamus, about 30% of thalamocortical cells have been called ‘‘High-Threshold Busting Cell’’ (HTC) due to their rhythmic bursting at the alpha rhythm. When these HTC neurons burst fire they inhibit other thalamic relay neurons via thalamic interneurons. I speculate that these HTC cells are in fact the same as the multiareal matrix cells and the neurons they inhibit are core relay neurons. If this is true, then on the alpha rhythm, multiareal matrix neurons will fire for ~50 ms while core neurons pause, and then core neurons will fire for 50 ms while multiareal matrix neurons pause, back and forth.

I propose that when multiareal neurons pause and core thalamic neurons are activated, macrocolumns lock into an ‘‘input state.’’ In this state, macrocolumns integrate bottom-up input from core thalamic neurons through L4-ST neurons and top-down input through apical dendrites of L2/3-PY neurons. Activation of L4-ST neurons excites inhibitory interneurons in L5 which directly inhibit L5a-RS and L5b-IB neurons. Hence during input states, superficial layers are activated, and deep layers are inactivated. “

The green arrows in the diagram are inhibitory, so you can see that L4 suppresses the L5 layers from sending outputs.

Bennett continues:
“However, when multiareal matrix neurons burst fire and core relay neurons pause, the macrocolumn shifts to an ‘‘output state.’’ In this state, I propose that L2/3-PY and L4-ST neurons will be inhibited, while deep layer neurons will become activated. There are several ways in which this could happen…”

The blogger’s diagram below shows one way: The HTC neurons, no longer inhibited by the core neurons, send a projection to L1, where they synapse on L5b-IB neurons, whose axons fire L6a-CT neurons, which in turn inhibit L4 and the L2/L3 layer.

Why have an output state?

“First, the output state enables a stable output of the L5b-IB sequence code, so that it can be passed to other regions without being interrupted by changes in sensory input.
Second, the output state enables the macrocolumn to reactivate memories within L2/3-PY via L6a-CT neurons without being disrupted by incoming sensory information through L4-ST.
Third, it provides a mechanism for macrocolumns to ‘‘reset’’ their representations in concert, and hence enable a network to re-lock into a new representation given new information.”

Given that in the output state, L2/3 is inactivated by inhibitory neurons from L6 you might ask how can it be used for working memories sparked by L6a-CT?
The answer is that this inhibition must be short lived, serving the purpose of wiping any remnant L2/3 representations while a L5a-RS representation is activated, so that by the time the L5a-RS code gets to L2/3, the neurons there are all in a passive state, ready to be activate a new representation (i.e. with no contextual lateral biasing from previous representations).

Bennett continues:

“I propose that there are two broad oscillatory modes of sensory thalamocortical networks: passive processing and attentive processing, each coordinating processing between different sets of regions at different frequencies.
I propose passive processing is the default thalamocortical network mode within the sensory cortex. In passive processing macrocolumns oscillate between input and output states at the alpha rhythm, spending roughly 50 ms in each state. …
However, I propose that during situations requiring top-down attention or working memory, thalamocortical networks slow down their oscillations to the theta frequency (about 100 ms in each state).”
I propose that the purpose of this oscillatory slowing is threefold.
First, the default oscillatory dynamics of the higher-order frontal cortex and hippocampus are in the theta frequency, hence to coordinate processing with those regions’ sensory cortex needs to also oscillate at the same rhythm.
Second, this slowing down gives L2/3-PY neurons more time in between input states to replay sequences, hence enabling more items to be stored in working memory. Third, this slowing gives L2/3-PY neurons more time to lock into a representation that well matches top-down input and bottom-up input.
…I propose that during periods of a good match between top-down expectations and bottoms up input, L2/3-PY neurons resonate at gamma oscillations…As proposed by others, I hypothesize that the function of these rapid oscillations during successful predictions facilitates long-term synaptic plasticity to learn new associations of objects and sequences being attended to.”

In the third item above Bennett says that if the top down information corroborates the bottom-up information there is more input to the L2/3 Pyramidal neurons, and their spiking oscillates at the gamma frequency which also allows for learning. (In another part of his article, that I won’t discuss here, Bennett explains how mismatches between expected and actual inputs creates a signal of “surprise”, and that too involves a matching (or lack of matching) between top-down signals with bottom-up signals, this time in the core neurons of the thalamus).

In the second item above, Bennett is saying that the slowdown in frequency allows more time for working memory. This explains a finding known since 1956:

“Psychologist George Miller showed that the average human can only hold around seven items in short-term working memory at a given point in time. However, a neural circuit explanation for why we have this working memory limitation has been elusive. Lisman and Idiart made the novel observation that the two frequencies observed in EEGs during working memory tasks, theta and gamma oscillations, have a clear relationship with the ‘‘magic number 7’’: there are ~7 gamma oscillations within one half of a theta wave (~100 ms). They went on to propose that elements in working memory are replayed at the gamma frequency every theta cycle.
Consistent with their idea, I propose that the reason we have this limitation is that the thalamocortical networks provide a maximum of ~100 ms within an output …”

We’ll end this blog post with Bennett’s model of how a macrocolumn can learn a sequence over realistic timescales. For instance, learning a sequence of letters: “A”, “B”, “C” where there is a pause of 5 seconds between letters.

From the article:

“ In our model macrocolumn, let us represent these different elements (‘‘A,’’ ‘‘B,’’ ‘‘C’’) by the activation of different sets of two minicolumns (see Figure 3A). Computationally five specific states will occur during the example procedure of learning this sequence:
(1) Receiving the input of ‘‘A’’ for 1 s:
(2) Pause (no input) for 5 s:
(3) Receiving the input of ‘‘B’’ for 1 s:
(4) Pause (no input) for 5 s:
(5) Receiving the input of ‘‘C’’ for 1 s.”


In step 1, figure 13 from the article shows a theta wave in the second row. The down cycle coincides with the input state of a macrocolumn, and the up cycle coincides with the output state. The third row shows several steps, but note the arrows from the theta wave diagram in the middle to the steps on the bottom – more steps occur on the up cycle, and so visually they don’t coincide on the diagram. Anyway, Bennett thinks that an ‘episode code’ from either the hippocampus or the frontal cortex, creates a random pattern in L6a-CT. L6a-CT feeds into L5a-RS, where it activates another random pattern. At this point though, “A” is input to L4 which creates a two-column representation for “A” in L2/L3. There is short term potentiation that occurs here, so the random pattern in L5a-RS gets associated with the pattern in L2/L3. L2/L3 then activates a sequence representation in L5b-IB. Then L2/L3 neurons all stop firing. Bennet explains this: “after being initially triggered from the prior activation of ‘‘A’’ in L2/3, the L5b-IB representation turns off any further L2/3 activation (by activating L6a-CT neurons, which then inhibit L4-ST neurons).
So we see a step labeled “Output code for single element sequence ‘A’”. This code is coming from L5b-IB (the orange colored row). Note that this code comes during the output state, which is in the up-cycle of the theta wave. Now the episode code, which is still present and coming from the frontal cortex or hippocampus, triggers the L5a-RS neurons again, creating the pattern it did before, which replays “A” in L2/L3. The L5b-IB neurons do not change, they are still firing the code for sequence “A”. The L5a-RS neurons bias the L2/L3 network in a characteristic way, so that if “A” continues to be input, it will create the same pattern in L2/L3.

Now what happens during the 5 second pause after “A”? Bennett explains:

Step 2:
“When the sensory input of ‘‘A’’ is removed during the pause, as long as the frontal/CA1 episode code continues to replay itself during this delay period, then the L6a-CT episode code will continue to independently replay ‘‘A’’ during each output state. Crucially, this means that at the beginning of each input state. L2/3 is sequence biased from the L5a-RS ‘‘A’’ representation, waiting to be mapped to the next incoming L2/3 representation. I propose that this continuous replay of an episode code is one of the key underlying computational processes performed by the brain during working memory tasks.”

I’ll use Bennet’s explanation here:

Step #3: Input “B” For 1 s
 After 5 s of a pause, the sensory input of ‘‘B’’ is provided to the macrocolumn. Due to the sequence biasing from L5a-RS neurons, a sparse representation of ‘‘B’’ is activated that is unique to ‘‘A –> B,’’ and the L5a-RS code is mapped to this sparse representation of B using STDP. Due to this unique representation of ‘‘B,’’ now L5b-IB neurons output an ‘‘A –> B’’ sequence code instead of just the ‘‘A’’ sequence code
Taken together, this means that although ‘‘A’’ and ‘‘B’’ were separated by 5 s, in the macrocolumn they were only separated by  –> 10 ms due to the repeated working memory replay of ‘‘A.’’ This enables rapid STDP plasticity between the L5a-RS neurons activated by ‘‘A’’ and the L2/3 representation of ‘‘B,’’ despite a 5-s separation between the actual sensory stimuli.
 When the repeating frontal/CA1 episode code comes around and reactivates ‘‘A’’ during the output state, the entire sequence ‘‘A –> B’’ will be replayed automatically, instead of just ‘‘A.’’

 Step #4: Pause For 5 s
 Due to the same dynamics described in step #2, as long as frontal cortex/CA1 continues to replay the same episode code, our model macrocolumn will continue to replay the sequence ‘‘A –> B’’ on each output state even when stimuli ‘‘B’’ is removed.
 The key difference between step #4 and step #2 is that now: (a) there are two elements replayed and hence two gamma cycles (A and then B); and (b) the output state now ends with a sequence bias from the L5a-RS code for ‘‘A –> B,’’ instead of the L5a-RS code for just ‘‘A.’’

Step #5: Input “C” For 1 s
 When ‘‘C’’ is finally inputted into the macro column after the final 5-s interval, as in step #3, the sequence bias from L5a-RS code for ‘‘A –> B’’ leads to a sparse representation of ‘‘C’’ that corresponds to the sequence ‘‘A –> B –> C’’ (see Figure 15). This builds plasticity between the L5a-RS code for ‘‘A –> B’’ and this sparse representation of ‘‘C.’’ Hence now when ‘‘A’’ is replayed during the output state, there will be 3 elements replayed (hence 3 gamma cycles): ‘‘A’’ then ‘‘B’’ then ‘‘C.’’
 During the output state, due to the L2/3 representation of ‘‘C’’ that is unique to ‘‘A –> B –> C,’’ the L5b-IB output code will now be a unique code that represents exactly the sequence ‘‘A –> B –> C’’ This macrocolumn has accomplished something amazing—it is now outputting a unique sequence code for the sequence ‘‘A –> B –> C’’ even though the input elements were separated by long time intervals. And the only external computation required was a constant episode code from the frontal cortex and/or hippocampus to enable consistent replay of only the first element ‘‘A.’’
 Remembering the Sequence “ABC” After Just Saying “A”
 Each time the sequence ‘‘A’’ then ‘‘B’’ then ‘‘C’’ replays during an output state while L5b-IB neurons are firing the ‘‘A –> B –> C’’ output code, each representation of ‘‘A’’ then ‘‘A –> B’’ and then ‘‘A –> B –> C’’ builds plasticity with L5b-IB representation of ‘‘A –> B –> C’’ (since they coactivate with each other). If this replay occurs a sufficient quantity of times, these synaptic connections will go through long-term potentiation (LTP). This LTP then makes it such that when this macrocolumn receives the input ‘‘A,’’ during the output state it will output the code ‘‘A –> B –> C’’ automatically instead of just the output for sequence ‘‘A.’’ Note that multiple L5b-IB representations can be active simultaneously, meaning that if ‘‘A’’ leads to multiple different sequences, multiple ambiguous sequence codes can be output for higher cortical areas to disambiguate.

One problem with the model is the following. The episode code activates L6a-CT neurons which put a random pattern into L5a-RS. Now an input of “A” comes in and the pattern in L5a-RS is associated with “A” in L2/L3. Suppose “A” in L5a-RS biases L2/3 as “A” is first learned so that “A” is a sparse pattern. Could it put some of the “A” neurons in L2/3 in inhibit mode, so that it modifies the original pattern of “A” in L2/3?

I emailed Max to ask him this, and he replied:

“The model would run into some problems if that occurred. For example, if I know 5 sequences starting with A, and an episode code creates a sparse representation of A, then the macrocolumn won’t be able to correctly predict what next elements might occur.

Admittedly however, I don’t specify a mechanism by which the first object attended to is not L5a-RS biased, while others are. But this is likely required for the model to work.

However, if episode codes get entrained to sensory cortex only in specific points in time (such as when I actively am attending to something), then this loss in predicting what will happen next may not be as severe. An interesting implication of this would be – if you are attending to something (hence entraining episode codes to sensory cortex), you could actually understand something less (since this biasing of a known object).”

Max’s theory unites a lot of unexplained phenomena like what oscillations in the brain might be for, how they work together, and how episodic memories might be stored. Episodic memories are memories of episodes that happened to you, as opposed to ‘semantic memory’ which is memory of facts. There are exciting implications, such that a place cells firing in a sequence in time could generate a memory of a traversal of a space with landmarks or even a more abstract ‘train of thought’.

Max also created a diagram of the neocortex with the known connections between the different types of neurons which is worth looking at for anyone who wishes to understand of model it. Again, the article is:
An Attempt at a Unified Theory of the Neocortical Microcircuit in Sensory Cortex by Max Bennett – Frontiers in Neural Circuits

How grid cells could code memories of episodes, and more, in the brain.

In my prior post, I wrote about Professor Michael Hasselmo’s book on grid cells in the brain, which create a type of GPS grid as we walk through a space.   But the main point of the book is that grid cells plus ‘place cells’ can be part of a circuit for storing episodic memory.
Professor Hasselmo gives a scenario of one of his work days where he parks his car in one of many spots in the college garage, and then walks down Cummington street, passes one of his students, and then continues to his office where he speaks with his wife, and so forth.   The memory of this set of episodes involves a trajectory through space.   It also involves a point of view – for instance, which direction on Cummington street was he walking and what was he looking at.   The point of view can be thought of as a directional component of a vector.   Memory also involves time – such as how fast he was walking down the street.   When he is sitting in a chair in his office and talks on the phone his position in space doesn’t change, but time does.
You and I normally remember in sequence, but we can relive one segment and then jump back in time to remember an earlier segment.
In his model, Grid cells are the inputs to ‘place cells’, where place cells indicate your position in a space (grid cells fire at every point in an imaginary grid stretched across the space, but a place cell might fire if, for example, you were in the center of a room (or the corner of a farmer’s field).  You also have ‘head direction’ cells, which code for where your gaze is pointing.   If we combine head direction with speed information, we have velocity (speed plus direction, though one problem with this theory is that you could be walking sideways while your gaze is straight up).  We will assume that ‘head direction’ cells don’t just code direction, they also code speed, and therefore are really ‘velocity’ cells.   Hasselmo sometimes uses a more general term “action” to describe what ‘head direction’ cells do, as if they are responsible for your movement.
In his model you could have grid cell activations leading to place cell activations which in turn lead to velocity cell activation and then the velocity updates the grid cells, altering their frequency and relative phase, and that then leads in a loop back to place cells.   It takes a while for the information to propagate in each step. So you can have a simulation of you going through space, at the speed in which you originally traversed it.
Lets examine in more depth:
The ‘velocity’ cells are driven by your senses that tell you how fast and what direction you are going.
The velocity in turn, causes frequency differences between two cells that normally fire at the same frequency  and this also implies that those cells will probably be out of phase when the creature stops moving.
Grid cells differ in their response to velocity.   Some have a frequency that changes only gradually with velocity, others have a more dramatic dependence on velocity.   Grid cells can also start at different initial phases.   The result is that each grid cell corresponds to a different grid – maybe the spacing between the gridlines is different, or the difference is in the orientation in which the grid overlays space, or both.
When you learn a trajectory, your movements and behavior drive head direction cells, which then drive grid cells which in turn drive place cells which have synapses on the original head direction cells.  Learning the trajectory involves strengthening some of the links between the place cells and the head direction cells.
In contrast, when you remember a trajectory, the main driver of the head cells is not sensory inputs and not behavior, rather, it is the place cells.   The cue for retrieval could be your current location as coded by place cells, or it could be environmental stimuli that were previously associated with a particular pattern of place cells.   For instance, a sight of the  Art Museum in New York’s Central Park could send signals to a particular place-cell vector. The velocity cells have to fire in recollection just as long as they did in the real event, because their firing at the correct rate for the correct amount the time creates the phase differences in the grid cells that existed in the original real-life scenario.
You are able to distinguish actually having an experience from remembering it, and Prof Hasselmo speculates that the different drivers of the head direction cells in two cases give you a way of knowing the difference.
Professor Hasselmo cautions that this mechanism is not the only possible one, for instance, you could have time-interval cells that fire at steady intervals and independently fire cells for place and for action.
If you model this mechanism with a neural net, you would have one weight matrix from place cells to head direction cells (WHP) and another matrix between the grid cells and the place cells (WPG).


There are objections to chaining models based on experimental data showing that participants can  retrieve the end of a sequence after missing one or more items, or they can retrieve the wrong order of items in a sequence.   Hasselmo gives alternatives, where for instance a cue activates time cells, or as the creature moves it activates ‘arc length’ cells.   The latter measure one dimensional distance, and are useful, oddly enough, because they are missing the direction information.   Imagine you are riding a bicycle with a device that measures your distance from the start of your ride.   You also have a list of directions.   One direction says, “at 5 miles, turn left on Magpie road.”   The route then makes a loop and comes back to the same point, at which point your set of directions says  “at 10 miles, don’t repeat your loop on Magpie road, but go up to Crawfish Hill Road.   Since you have been keeping track of your distance, you know what to do.   If you just had 2 dimensional information, such as where you currently are on a map, and no memory of other than that, you would not know which road to take.   You need to know how far you’ve gone so far.   That is what the firing of arc-length cells would tell you.  ‘Place’ cells alone could not tell you which ‘head direction’ cell should fire next, in the case of a loop like this.
Another alternative might be ‘time cells’   (“after 1 hour of biking, take Magpie road, after two hours take Crawfish Hill Road”.)

In the prior examples, the phase differences that make grid cells fire are driven by arc cells or ‘time cells’, and not head-direction (velocity) cells.   (A velocity cell minus direction information is essentially an arc cell)

In the example of remembering at what parking spot you parked your car in the garage next to your workplace, you don’t want yesterday’s parking spot to interfere with today’s.    As  you walk into the garage, a trajectory would fire, and perhaps at ambiguous points the arc-cells or the time-cells would lead you in the correct direction.   The cue that resets the arc cells has to be some difference between today and yesterday.
To remember what you saw at different parts of your day – walking away from your car, and then along the street, and then into your building, and who you talked to, there would be a learned association between specific ‘place cell vectors’ and the sensory patterns that you experienced.   One advantage of this two-way association is that remembering a particular sensory cue can activate the place cells that were firing when you first were at that spot and the memory sequence could start in mid-trajectory.
If  you see the same room only at rare intervals, it has been found that grid cells and place cells show the same (stable) firing each time.   In the model, this requires sensory cues to set the phase of firing of grid cells to the same starting point.
The associations for this type of memory requires associations between the code for space and time with the coding of actions, items, or events. The code for space and time comes in the form of place cells, arc length cells, and time cells.   These are associated with actions in the form of speed cells and head direction cells.  The model also uses bidirectional associations between the code for space and time with cells coding features of individual events.   So a trajectory can cue the retrieval of an event (remembering what happened when you opened the door to the tiger cage in the zoo) or conversely, seeing a picture of a tiger can remind you of your quick trajectory out of the tiger cage and out of the Zoo.   In addition, an association of one item can lead to a trajectory of other items and events.


We could start speculating here.   Any “train of thought” is a sequence where one item leads to another.   Could grid cells and place cells be involved?   Hasselmo also has a chapter on goal directed behavior where place cells propagate signals along a path back from a goal, which meet grid cells signals propagating forward.   This sounds like problem solving – not just remembering a path.
When we look at any part of a room, we are focussed on only a small part of the room – at any moment, we assume the rest of the room is as we recently saw it.   Perhaps our experience of a room is a trajectory around the room associated at various positions with objects such as book shelves, chests of drawers, and lamps.  In fact, researchers at Numenta, a company that attempts to understand the cortex, hypothesize that every object in that room is itself represented by some type of grid cell trajectory and that these grid cells are in every column of the cortex.   They also believe that objects are ‘recursive’ – so for instance when you look at a cup with a handle, the handle itself has its own grid cell trajectory.
How We Remember – Michael Hasselmo (2012 – MIT)

Representing Space and Time by Neural Phase Shifts

An interesting way to represent information in neurons is explained in the book “How We Remember” by Michael Hasselmo. We have a type of GPS system – a grid that appears in our head whenever we go into a room, for instance. In fact, we have multiple spatial grids that vary in their spacing and in their orientation relative to the walls of the room.
To simplify, lets assume there is just one neuron per grid. As an additional simplification think of the grid as a series of vertical lines intersected by horizontal lines. Each intersection is a vertex. As you move through the room, every time you cross a vertex, the neuron will fire. For example, if the room has a square shape, then if you moved diagonally across the room, assuming the grid origin is at a corner of the room, you would intersect vertices as you crossed each square of the grid diagonally. This might not seem all that helpful in pinpointing your position since the neuron fires at many places in the room – in other words at every vertex of the grid.
In the top half of Figure 1 above, you see on the left the areas in space that fire one single grid cell.   All grid cells in a particular module have the same orientation and scale (scale means distance between areas that it will fire at).  At the top right, you see the firing areas of 2 cells in a module (green for one, blue for another).   The problem is that even though the cells are somewhat displaced from each other, the patterns repeat, and so if the two cells fire, you still can’t tell where in the room the creature is.   You can rule out some areas (the white areas), but the result is still ambiguous.
The saving grace is that you have several grids with different scales / spacing, and different orientations. If three different neurons, each for a different grid with different properties, fire at the same time the ambiguity in position is reduced.   At some point the combinations do repeat, but it will take much more space for that to happen. (see bottom half of figure 1)
To simplify  we will assume that there is just one dimension. (Your room has no width, only length.) Start with two cells that fire at the same frequency and the same phase. If a third cell fires when the spikes of the two cells impinge on it within a narrow time frame, then not surprisingly it will fire when they do. Now lets suppose the two neurons get out of phase. In that case the third cell will not fire (if they are sufficiently out of phase).
Suppose that one of the neurons will fire faster when you walk through the room. (The neuron that doesn’t change is called the ‘baseline’ neuron) The neuron that does change speeds up its spike rate the faster you walk. Not surprisingly, the baseline neuron will no longer be in phase with the speedup neuron. The third neuron that detects the coincidence of the other two will not fire during this period.


You can see the two neurons (B and C) start in phase, but neuron B speeds up its frequency for a short time, and then is in no longer in phase with neuron C (C is the baseline)

If you keep moving, at some point the spikes of the neurons are in phase again. You can think of this as two runners traveling around a circular track at different speeds. The faster runner catches up with the slower one, at which point their positions coincide, but then he runs past and the positions no longer coincide (until he overtakes again).

The first period of movement put the neurons out of phase, but the second period of movement was just long enough to get them into phase again.

So the third neuron (which is active when the spikes from the other 2 coincide) fires at regular intervals as you cross the room, if you cross the room at a steady speed.
This works at any speed. If you walk slowly then the non-baseline neuron is still firing faster than the baseline neuron, but not as much faster as in the prior case. So its frequency increment vs baseline is slower, and it takes more time for the two neurons to coincide.
The baseline neuron and the speedup neuron will still fire a spike simultaneously at the vertex, because the extra time to get to the peaks to coincide matches the extra time for you to reach the point in the room at your slower pace. So the grid, as expressed by the firing of a neuron, is not altered by how fast you move between vertices.

Moving quickly (upper example) vs moving slowly (lower example).  The frequency of the speedup neuron is faster in the first example, so the neurons get back in phase quicker.

The above example was for one dimension, and required three neurons. Suppose we want to extend the model to two dimensions. To do that we add another neuron. Its normal frequency is the same as the others, but like the other non-baseline neuron, it is also fires faster when you move, but only when you move south. As the angle deviates from due south, it fires less strongly (it falls off as the cosine of the angle). The other original non-baseline neuron will fire strongly as you go due east, and its firing will also fall off, as the cosine of the deviation from due east. (In practice these neurons don’t fire much at all after a deviation of 45 degrees off their preferred direction.)
Lets assume that the summing neuron now sums up the baseline neuron and the two directional neurons, and will only fire when they all are close to their peaks at the same time. In that case, it will fire at the vertices of a two dimensional grid. This summing neuron that fires when the spikes of all its inputs coincide is a grid cell .
To make several grid cells with different spacing, we could repeat several groups of four neurons, each group with a different innate base frequency. For example a group with a high frequency baseline results in its three input neurons firing together at vertices that were spatially closer – the summing neuron would correspond to a grid with small spacing between the vertices.
In the brain, the grids are hexagonal, or you can think of them as tiled triangles, so instead of having a neuron that fires when you go due south and another when you go due east, you would have three neurons that fire maximally at directions 120 degrees apart (3 * 120 = 360 – a complete revolution in direction back to the starting point).

We described intervals for space, but you can use neurons for time intervals as well.  If you have two neurons firing at the same frequency but with different phases, they will coincide at regular time intervals.   If you add a third neuron firing out of phase with the other two, then you have the coincidence of all three neurons repeating at a  longer time interval.If you have several grid cells, and they don’t have the same spacing, then, as I said earlier, you can disambiguate your location (one grid cell could mean you were at any vertex of the imaginary GPS grid overlaying the room).   This next illustration shows the one dimensional example again.   Each grid cell has a different spacing (the top one has the widest) and the sum of the top three grid cells peaks in only one place on the x-axis.   It is true that eventually there will be another peak, but if your room is small enough, you will run into a wall before that happens. You can always add cells that repeat at even larger spacings.


The next image shows the two dimensional case. When the 120 degree and the 240 degree and the 360 degree neurons coincide, you have the intersections of the 3 angled lines in the image.


The hippocampus has “place cells” that fire when you are at a particular place in a room. It is possible to model such cells by taking three grid cells at random as inputs to the place cell.  Prof. Hasselmo has an algorithm where the connections from those grid cells to the place cell are strengthened if they correspond to only one location, but I must confess that I did not understand his method from the text.

His model allows place cells of different ranges, so you could have a place cell that fires only when you are walking under the lamp in a room, and another that fires when you are walking anywhere in the town square but changes when you walk out of it.Michael Hasselmo also has a theory of how we remember, based on grid cells and place cells. I will describe his theory when I write the next post.   Numenta, which is a company in California that is attempting to model the cortex, also has a theory that the entire cortex uses a grid cell mechanism for general thinking.   I’ll describe that theory in another post as well.

How We Remember – Brain Mechanisms of Episodic Memory – by Michael E Hasselmo (2012 MIT Press)

Eric Kandel tells us what unusual Brains tell us about ourselves in his new book.

Prof. Kandel

Eric Kandel wrote the standard textbook of Neural Science (in which he was a pioneer) and just published a book titled The Disordered Mind – What Unusual Brains Tell Us About Ourselves.. I will just select a few interesting points from the book about the pre-frontal cortex, since that is a part of the brain that gives us all sorts of desirable characteristics, such as will-power, concentration, decision making, judgment, and planning for the future. The pre-frontal cortex has neurons that are key to working memory too. Plus it is important in what we call the moral emotions–indignation, compassion, shame, and embarrassment.

If you are under a lot of stress, your adrenal gland releases cortisol, which heightens vigilance. Unfortunately, over time it destroys synaptic connections in your prefrontal cortex. The mental disease of depression also causes a flood of Cortisol, and interestingly, various drugs that treat depression, such as Imipramine and Iproniazid and Ketamine, increase the number of synapses in the prefrontal cortex.
Ketamine works faster than traditional antidepressants, so it is prescribed for the first two weeks, until the other antidepressants take effect. The idea is to prevent suicide in those two weeks. It would be used for more than two weeks, but the problem is it has side effects. Ketamine works faster because it latches onto receptors on the target cell and keeps an excitatory neurotransmitter (Glutamate) from occupying that receptor and over-exciting the neuron. This is a direct effect. Other drugs usually target Serotonin, whose effects are indirect, they somehow fine tune the action of other neurotransmitters, whether excitatory or inhibitory. And the effects are slower.
(Ketamine is not always used wisely, in fact, drug dealers sell it and their customers often use it along with other drugs such as Ecstasy or cocaine or sprinkle it on marijuana blunts.  You can’t save people from themselves.   But I digress.)
A really interesting fact is that children have many more synapses than adults. Beginning about puberty, synaptic pruning removes the dendritic spines that the brain isn’t using, including spines that are not actually helping working memory. (Each incoming neural dendrite has spines on it, and those spines are where other neurons connect.) In Schizophrenia, synaptic pruning appears to go haywire during adolescence, snipping off far too many dendritic spines. So if you are a parent who has Schizophrenia in the family tree, you can’t breathe easy about the risk to your children until they get past adolescence.
There is actually a gene named C4 that produces a protein involved in tagging synapses for removal, and a variant of it named C4-A facilitates over-aggressive pruning.

Kandel’s subtitle for the book indicates that he is interested on what the various abnormalities he describes in the book, including Autism, Alzheimer’s, Gender behavior that doesn’t match external appearance, reveal about those of us without brain disorders.

One question raised in my mind was sparked by the section on Bipolar disorder. Bipolar disorder is characterized by extreme changes in mood, thought, and energy. Manic episodes include racing thoughts and decreased need for sleep. These episodes can be associated with high-risk behaviors such as excessive spending. Patients may get in trouble with the law. Later, a depressive episode will occur.
Kandel writes: “Once the first manic episode is initiated—usually at the age of seventeen or eighteen–the brain is changed in ways we do not yet understand, such that even minor events can trigger a later manic episode.”
The question that occurred to me is – are the rest of us feeling the appropriate mood for our situation?. Are we too much into optimism and risk taking? Are we too depressed and pessimistic? What is the right balance?

As you get old, you may develop Frontotemporal Dementia, which begins in the frontal lobe. Your moral reasoning degenerates, and you may end up being arrested for acts such as shoplifting. You may spend yourself into bankruptcy, or regularly overeat. So lets not be too judgmental on old people who start acting in an anti-social way.

In psychopaths, there is more gray matter (cell bodies) in and around the limbic system (which is involved in emotion), but the neural circuitry that connects the emotional areas to the frontal lobes is disrupted. I don’t know what this really means, but maybe we should not be too judgmental about psychopaths either!
More than 3 million Americans have bipolar disorder. About 20 million Americans suffer from major depression. Much suffering is caused by disorders of the brain. There is hope for  these disorders (if they are not caused by early miswiring in development, as Schizophrenia often is). Scientists are already finding genes involved in these diseases, and relevant proteins that are either overactive, or under-active, or malformed in some way. Perhaps as well some day researchers will find a way to make neurons divide and fill wounded areas.

Kandel’s book gives you a different way at looking at human behavior – and is worth reading.

How we use coherence of concepts to build ideologies and make sense of our world.

Much of human cognition can be though of as ‘constraint satisfaction’ according to philosopher Paul Thagard. For example, think of applying to a university. One college is in a beautiful setting, but another college has a professor who is an expert in your desired major. The first college is in a quaint town with a low crime rate, the second one is in an city with a high crime rate. You have a scholarship to the first college, but the second college charges less for tuition. And so forth.
Or suppose you are a detective in a murder case where the prime suspect is the daughter of the victim, a rich industrialist. The daughter was in line to inherit the family fortune. You interview the daughter, and find out she dedicates her spare time to helping the needy. Then you find out that her boyfriend is a fellow she rescued from jail. So again, there is information that leads you in conflicting directions.
One way to manage all the conflicts (or even just priorities) is by constraint satisfaction.

The following is a diagram of a simple situation. You are thinking of hiring a local carpenter named Karl, but you need to know whether you can trust him alone in your house. You know he’s a gypsy, and that the gypsy culture has allowed thievery from outsiders. So that knowledge would push you in one direction. But then you hear that he returned a lost wallet to your neighbor. So that pushes you in another direction. (I scanned the next figure, which illustrates the Karl scenario) from a small paperback, so the orientation is disturbing my coherence, but here it is)


The dotted lines are inhibitory, and connect incompatible nodes or hypotheses. The normal lines are excitatory. All connections are bidirectional – so that if node-A reinforces node-B, then node-B reinforces node-A also.

In this picture, the hypothesis of being honest is incompatible with being dishonest, so there is a dotted line between them. The action of returning the wallet is compatible – in fact is evidence for – honesty, and so there is a full line – an excitatory connection between them.

But decisions aren’t just made based on evidence, there is often an emotional component. Another diagram, a cognitive affective map, can show the influence of emotions:


Ovals are used for positive valences (a positive emotion) so in this example the oval around ‘food’ indicates that food is a desirable concept’. Hexagons have negative valences (and so the shape used in the diagram for hunger is a hexagon). Rectangles are neutral – you are not pro-or anti-broccoli in this example.

The diagrams can apply to political attitudes. For instance, in Canada, the law says you should refer to ‘trans’ people by their preferred pronoun (which might be neither ‘he’ nor ‘she’). Some Canadian libertarians, notably Jordan Peterson, have objected to this. Here are two diagrams from a 2018 article by Paul Thagard showing how a liberal, for whom equality ia a paramount value, might look at the issue, versus how a libertarian might look at the issue..


The green ovals with the strong borders show what the liberal prioritizes (equality) versus what the libertarian prizes (freedom).  In the lower diagram, the libertarian considers freedom as somewhat incompatible with regulation, and with taxation, but compatible with private property and economic development.    As a libertarian, you may take it as inevitable that economic development will result in income inequality, which is why the desirable value of ‘economic development’ has a inhibitory link with ‘income equality’ in the second diagram.   As a liberal, prioritizing equality, you might see the positive links between capitalism and the negative nodes of ‘exploitation’ and ‘inequality’, so even though there is a positive link between ‘capitalism’ and ‘freedom’ in the first diagram, you might, after the various constraints interact and settle down on a solution, want to modify Capitalism.

One way of learning about an opponent’s perspective is to draw the diagrams of how you believe your opponent he thinks -and then have him critique it and redraw it.

One advantage of such diagrams is that you can use an iterative (repetitive) process to spread the activations and find out, after the dust settles, which nodes are strongly activated.

You start by assigning activations to each node. We can assign all of the nodes an initial activation of .01, for example, except for nodes such as ‘evidence nodes’ that could be clamped at the maximum value (which is 1, the minimum value it can take is -1 ).  Evidence might be an experimental finding, or an item in the newspaper or an experience you had.

The next step is to construct a symmetric excitatory link for every positive constraint between two nodes (they are compatible) . For every negative constraint, construct a symmetric inhibitory link.

Then update every node’s activations based on the weights on links to other units, the activations of those other units, and the current activation of the node itself. Here is an equation to do that:


Here d is a decay parameter (say 0.05) that decrements each unit at every cycle, min is a minimum activation (-1) and max is (1). ‘net’ is the net input to a unit, it is a sum of the product of weights * activations of the nodes that the unit links to.

The net updates for several cycles, and after enough cycles have occurred, we can say that all nodes with an activation above a certain threshold are accepted. You could end up with the net telling you to go to that urban college, or the net telling you that the daughter of the industrialist is innocent, or that a diagnosis of Lyme disease is unwarranted, or that you should not trust Karl.

There are several types of coherence, and they often interact. Professor Thagard gives an example:

In 1997 my wife and I needed to find someone to drive our six-year-old son, Adam, from morning kindergarten to afternoon day care. One solution recommended to us was to send him by taxi every day, but our mental associations for taxi drivers, largely shaped by some bizarre experiences in New York City, put a very negative emotional appraisal on this option. We did not feel that we could trust an unknown taxi driver, even though I have several times trusted perfectly nice Waterloo taxi drivers to drive me around town.
So I asked around my department to see if there were any graduate students who might be interested in a part-time job. The department secretary suggested a student, Christine, who was looking for work, and I arranged an interview with her. Very quickly, I felt that Christine was someone whom I could trust with Adam. She was intelligent, enthusiastic, interested in children, and motivated to be reliable, and she reminded me of a good baby-sitter, Jennifer, who had worked for us some years before. My wife also met her and had a similar reaction. Explanatory, conceptual, and analogical coherence all supported a positive emotional appraisal, as shown in this figure:


Conceptual coherence encouraged such inferences as from smiles to friendly, from articulate to intelligent, and from philosophy graduate student to responsible. Explanatory coherence evaluated competing explanations of why she says she likes children, comparing the hypothesis that she is a friendly person who really does like kids with the hypothesis that she has sinister motives for wanting the job. Finally, analogical coherence enters the picture because of her similarity with our former baby-sitter Jennifer with respect to enthusiasm and similar dimensions. A fuller version of the figure would show the features of Jennifer that were transferred analogically to Christine, along with the positive valence associated with Jennifer.

If we leave out ’emotion’ then we just spread activations and compute new ones. To include emotions, we assign a ‘valence’ (positive or negative) to the nodes as well, and those valences are like the activations, in that they can spread over links, but with a difference – their spread is partly dependent on the activation spread.

Take a look at this diagram:


There is now a valence node at the top, that sends positive valence to honest’ and ‘negative’ valence to ‘dishonest’. When the net is run, first the Karl node is activated, which then passes activations to the two facts about him, that he is a gypsy, and he also returned a wallet. If ‘honest’ ends up with a large activation, then it will spread its positive valence to ‘returned wallet’ and then to Karl.

The equation for updating valences is just like updating activations, plus the inclusion of multiplying by valence.

Some interesting ideas emerge from this. One is the concept of ‘meta-coherence’. You could get a result with a high positive valence, but it is just above threshold, and you therefore not sure of it, which could cause you distress. You might have to make a decision that is momentous, which you really can’t fully be confident is the right one.
Another emotion, surprise, could result from many nodes switching from accepted to rejected or vice versa as the cycles progress. You may find that you had to revise many assumptions.
Humor is often based on a joke leading you toward one interpretation, and then ending up with a different one at the punch line. Professor Thagard says that the punch line of the joke shifts the system into another stable state distant from the original one.

In an actual brain, concepts are not likely to be represented by a single neuron, it is more likely that population codes (such as semantic pointers) would be used. So an implementation of the above relationships between concepts would be more complicated. Moreover, the model doesn’t explain how the original constraints between concepts are learned. I would guess that implementation details might modify the model somewhat. Coherence doesn’t ‘mean that multiple rational people will come to the same conclusions on issues – even scientists who prize rationality often disagree with each other. Sometimes, even the evidence you will accept depends on a large network of assumptions and beliefs. What nodes do you include? What weights to you assign to the constraints?

Still, the model is intuitive and makes sense.

You can get a link to the various programs mentioned at http://cogsci.uwaterloo.ca/JavaECHO/jecho.html.   There is also  more info at PaulThagard.com.

Coherence in thought and action – Paul thagard 2000 MIT press
Emotional Consciousness: A neural model of how cognitive appraisal and somatic perception interact to produce qualitative experience
Thagard, P. (2018). Social equality: Cognitive modeling based on emotional coherence explains attitude change. Policy Insights from Behavioral and Brain Sciences., 5(2), 247-256.


Aha Moments, Creative Insight, and the Brain

In “The Eureka Factor – Aha Moments, Creative Insight, and the Brain”, authors John Kounios and Mark Beeman discuss insight – the kind of insight that might occur to you when taking a walk or taking a shower as opposed to trying to force a solution to a problem in your office under a deadline. (One creative inventor that they mention sets up his environment to encourage insights – at night he will sit on his armchair on his porch looking at the stars, with nondescript music in the background to drown out distracting noises.)
MRI experiments have shown that insight really does happen suddenly, its not just an illusion. (when it happens, there is a ‘gamma’ burst of activity in a part of the brain in the right hemisphere). While ‘analytical thinking’ is a process that builds systematically to a conclusion, insight doesn’t work that way, though it benefits from the thinker having looked at the problem from all angles.

Here are a few conclusions by the authors:

  1. …perceptual attention is closely linked to conceptual attention. Factors that broaden your attention to your surroundings, such as positive mood, have the same effect on the scope of your thinking. Besides taking in lots of seemingly unrelated things, the diffuse mind also entertains seemingly unrelated ideas.

  2. if you question people, you’ll find that some see meaning everywhere, in events like the Japanese tsunami and in cryptic sayings like those above. They will give you impassioned explanations of the significance of such things. Other people deny any inherent meaning. “Stuff just happens. Live with it.

    It was found that people who see meaning in so many life events are also people who trust their hunches and their intuition. Intuition is related to creative thought.

  3. Creative people can be odd:
    The book contains a quote by Ed Catmull, president of Walt Disney Animation Studios and Pixar Animation Studios. He said:

    “There’s very high tolerance for eccentricity; there are some people who are very much out there, very creative, to the point where some are strange.” He values that creative eccentricity and is willing to tolerate a lot of the weirdness that often accompanies it. But movies are made by teams of people and not by a single person, so he has to draw a line. “There are a small number of people who are, I would say, socially dysfunctional, very creative,” he said. “We get rid of them.”

So what are the neural underpinnings to the creative – insightful type?

The authors think there is a reduced inhibition.

Inhibition, as a cognitive psychologist thinks of it, regulates emotion, thought, and attention. It’s a basic property of the brain.
…when you purposely ignore something, even briefly, it’s difficult to immediately shift mental gears and pay full attention to it, a phenomenon called “negative priming.” This can sometimes be a minor inconvenience, but it occurs for a reason. When you ignore something, it’s because you deemed it to be unimportant. By inhibiting something that you’ve already labeled as irrelevant, you don’t have to waste time or energy reconsidering it. More generally, inhibition protects you from unimportant, distracting stimuli.

To me, (the blogger), it doesn’t make much sense that creative people would be more distractible. Or at least, I would think that creativity is not just a matter of casting a wide net to gather associations of little relevance to the problem at hand. That could be a part of it, of course.

Supporting that idea is the fact that insightfuls, in a resting state (when not solving problems) have more right-hemisphere activity and less left-hemisphere activity than normals. The right hemisphere differs from the left in that in many of its association areas, the neurons have larger input fields than do left hemisphere neurons. Specifically, right hemisphere pyramidal neurons have more synapses overall and especially more synapses far from the cell body. This indicates that they have larger input fields than corresponding left hemisphere pyramidal neurons. Because cortical connections are spatially organized, the right hemisphere’s larger input fields collect more differentiated inputs, perhaps requiring a variety of inputs to fire. The left hemisphere’s smaller input fields collect similar inputs, likely causing the neuron to respond best to somewhat redundant inputs.

Even the axons in the right hemisphere are longer, suggesting that more far-flung information is used.

Both hemispheres can work together to solve a problem, so you can have the best of the both worlds – a narrow focused approach, and a diffuser, more creative approach.

If you want to increase your own insights, the authors have various suggestions.

  1. Expansive surroundings will help you to induce the creative state. The sense of psychological distance conveyed by spaciousness not only broadens thought to include remote associations, it also weakens the prevention orientation resulting from a feeling of confinement. Even high ceilings have been shown to broaden attention. Small, windowless offices, low ceilings, and narrow corridors may reduce expenses, but if your goal is flexible, creative thought, then you get what you pay for.

  2. You should interact with diverse individuals, including some (nonthreatening) nonconformists.
  3. You should periodically consider your larger goals and how to accomplish them, merely thinking about this will induce a promotion mind-set. Reserve time for long-range planning. Thinking about the distant future stimulates broad, creative thought.
  4. Cultivate a positive mood…To put a twist on Pasteur’s famous saying, chance favors the happy mind.

So if you are tired of working at your desk, wave “The Eureka Factor” at your boss, and tell him that you need to hike in an alpine meadow with your eccentric friend with the guitar who never graduated high school, and maybe he’ll let you do it!

The Cognitive Neuroscience of Insight – John Kounios and Mark Beeman
The Eureka Factor – John Kounios and Mark Beeman (2015)

The frustrating insula – or why Brain Books can’t match Shakespeare

Often popular books on the brain will tell you that a particular part of the brain is responsible for various human attributes, but there is no common thread that jumps out at you. You learn more about people reading a good novel than you do after reading 100 pages of bewildering functions of grey matter..

The Insula (see diagram below) is an example. I’ll list a few tantalizing conclusions from various studies, and if you find a common thread, add a comment and let me know.


According to neuroscientists who study it, the insula is crucial to understanding what it feels like to be human.

They say it is the wellspring of social emotions, things like lust and disgust, pride and humiliation, guilt and atonement. It helps give rise to moral intuition, empathy and the capacity to respond emotionally to music.

So here are a few findings on this part of the brain:

  1. A conservative or left-wing brain? – liberals have higher insula activation:
    Researchers have long wondered if some people can’t help but be an extreme left-winger or right-winger, based on innate biology. To an extent, studies of the brains of self-identified liberals and conservatives have yielded some consistent trends.Two of these trends are that liberals tend to have more the insula and anterior cingulate cortex. Among other functions, the two regions overlap to an extent by dealing with cognitive conflict, in the insula’s case, while the anterior cingulate cortex helps in processing conflicting information.Conservatives, on the other hand, have demonstrated more activity in the amygdala, known as the brain’s “fear center.” “If you see a snake or a picture of a snake, the amygdala will light up.
  2. Higher insula activation when thinking about risk is associated with criminality. In fact criminals think about risk in an opposite way to law-abiding citizens:
    A study has shown a distinction between how risk is cognitively processed by law-abiding citizens and how that differs from lawbreakers, allowing researchers to better understand the criminal mind.“We have found that criminal behavior is associated with a particular kind of thinking about risk,” said Valerie Reyna, the Lois and Melvin Tukman Professor of Human Development and director of the Cornell University Magnetic Resonance Imaging Facility. “And we have found, through our fMRI capabilities, that there is a correlate in the brain that corresponds to it.”In the study, published recently in the Journal of Experimental Psychology, Reyna and her team took a new approach. They applied fuzzy-trace theory, originally developed by Reyna to help explain memory and reasoning, to examine neural substrates of risk preferences and criminality. They extended ideas about gist (simple meaning) and verbatim (precise risk-reward tradeoffs), both core aspects of the theory, to uncover neural correlates of risk-taking in adults.

    Participants who anonymously self-reported criminal or noncriminal tendencies were offered two choices: $20 guaranteed, or to gamble on a coin flip for double or nothing. Prior research shows that the vast majority of people would chose the $20 – the sure thing. This study found that individuals who are higher in criminal tendencies choose the gamble. Even though they know there is a risk of getting nothing, they delve into verbatim-based decision-making and the details around how $40 is more than $20.

    The same thing happens with losses, but in reverse.

    Given the option to lose $20 or flip a coin and either lose $40 or lose nothing, the majority of people this time would actually choose the gamble because losing nothing is better than losing something. This is the “gist” that determines most people’s preferences.

    Those who have self-reported criminal tendencies do the opposite through a calculating verbatim mindset, taking a sure loss over the gamble.

    “This is different because it is cognitive,” Reyna said. “It tells us that the way people think is different, and that is a very new and kind of revolutionary approach – helping to add to other factors that help explain the criminal brain.

    As these tasks were being completed, the researchers looked at brain activation through fMRI to see any correlations. They found that criminal behavior was associated with greater activation in temporal and parietal cortices, their junction and insula – brain areas involved in cognitive analysis and reasoning.

    “When participants made reverse-framing choices, which is the opposite of what you and I would do, their brain activation correlated or covaried with the score on the self-reported criminal activity,” said Reyna. “The higher the self-reported criminal behavior, the more activation we saw in the reasoning areas of the brain when they were making these decisions.”

    Noncriminal risk-taking was different: Ordinary risk-taking that did not break the law was associated with emotional reactivity (amygdala) and reward motivation (striatal) areas, she said.

    Not all criminals are psychopaths, but psychopaths show differences as well.
    A study of 80 prisoners used functional MRI technology to determine their responses to a series of scenarios depicting intentional harm or faces expressing pain. It found that psychopaths showed no activity in areas of the brain linked to empathic concern. The participants in the high psychopathy group exhibited significantly less activation in the ventromedial prefrontal cortex, lateral orbitofrontal cortex, amygdala and periaqueductal gray parts of the brain, but more activity in the striatum and the insula when compared to control participants, the study found.The high response in the insula in psychopaths was an unexpected finding, as this region is critically involved in emotion and somatic resonance. Conversely, the diminished response in the ventromedial prefrontal cortex and amygdala is consistent with the affective neuroscience literature on psychopathy. (This latter region is important for monitoring ongoing behavior, estimating consequences and incorporating emotional learning into moral decision-making, and plays a fundamental role in empathic concern and valuing the well-being of others.)

  3. Damaging the insula can cure addiction:
    The recent news about smoking was sensational: some people with damage to a prune-size slab of brain tissue called the insula were able to give up cigarettes instantly.
  4. The insula is responsible for the feeling of disgust:
    Insula activation was only significantly correlated with ratings of disgust, pointing to a specific role of this brain structure in the processing of disgust. This ties in somehow to what I cited before on political leanings. In one study, people of differing political persuasions were shown disgusting images in a brain scanner. In conservatives, the basal ganglia and amygdala and several other regions showed increased activity, while in liberals other regions of the brain increased in activity. Both groups reported similar conscious reactions to the images. The difference in activity patterns was large: the reaction to a single image could predict a person’s political leanings with 95% accuracy (this may be hard to believe, but it is according to Neuroscientist Read Montague, who works at Virginia Tech in Roanoke. It is reported in newscientist.com which in turn cites his research article).

I’ve listed all these items, many very interesting, but at the end of the day, what is going on?





Structural and Functional Cerebral Correlates of Hypnotic Suggestibility – Alexa Huber, Fausta Lui, Davide Duzzi, Giuseppe Pagnoni, Carlo Adolfo Porro


Neural Arithmetic Logic Units – getting backpropagation nets to extrapolate

Backpropagation nets have a problem doing math. You can get them to learn a multiplication table, but when you try to use the net on problems where the answers are higher or lower than the ones used in training, they fail. In theory, they should be able to extrapolate, but in practice, they memorize, instead of learning the principles behind addition, multiplication, division, etc.

A group at Google DeepMind in England solved this problem.
They did this by modifying the typical backprop neuron as follows:

  1. They removed the bias input
  2. They removed the nonlinear activation function
  3. Instead of just using one weight on each incoming connection to the neuron, they use two. Both weights are learned by gradient descent, but a sigmoid function is applied to one, a hypertangent function is applied to the other, and then they are multiplied together. In standard nets, a sigmoid or hypertangent function is not used on weights at all, instead these types of functions are used on activation.  The opposite is true here.

Here is the equation for computing the weight matrix.  W is the final weight, and the variables M and W with the hat symbols are values that are combined to create that final composite weight:


So what is the rationale behind all this?

First lets look at what a sigmoid function looks like:


And now a hypertangent function (also known as ‘tanh’):


We see that the sigmoid function ranges (on the Y axis) between 0 and 1. The hypertangent ranges from -1 to 1. Both functions have a high rate of change when their x-values are fairly close to zero, but that rate of change flattens out the farther they get from that point.

So if you multiply these two functions together, the most the product can be is 1, the least is -1, and there is a bias to the composite weight result – its less likely to be fractional, and more likely to be -1, 1, or zero.
Why the bias?
The reason is that near x = zero, the derivative being large actually indicates that the neuron would be biased to learn numbers other than that point (because it will take the biggest step sizes when the derivative is highest). Thus, tanh is biased to learn its saturation points (-1 and 1) and sigmoid is biased to learn its saturation points (0 and 1). The elementwise product of them thus has saturation points at -1, 1, and 0.

So why have a bias? As they explain:

Our first model is the neural accumulator (NAC), which is a special case of a linear (affine) layer whose transformation matrix W consists just of -1’s, 0’s, and 1’s; that is, its outputs are additions or subtractions (rather than arbitrary rescalings) of rows in the input vector. This prevents the layer from changing the scale of the representations of the numbers when mapping the input to the output, meaning that they are consistent throughout the model, no matter how many operations are chained together.

As an example, if you want the neuron to realize it has to add 5 and -7, you don’t want those numbers multiplied by fractions, rather in this case, you prefer 1 and -1. Likewise, the result of this neuron’s addition could be fed into another neuron, and again, you don’t want it multiplied by a fraction before it is combined with that neuron’s other inputs.

This isn’t always true though, one of their experiments was learning to calculate the square root, which required a weight training to the value of 0.5.

On my first read of the paper, I was sure of why the net worked, and so I asked one author: Andrew Trask, who replied that it works because:


  1. because it encodes numbers as real values (instead of as distributed representations)
  2. because the functions it learns over numbers extrapolate inherently (aka… addition/multiplication/division/subtraction) – so learning an attention mechanism over these functions leads to neural nets which extrapolate


The first point is important because many models assume that any particular number is coded by many neurons, each with different weights. In this model, one neuron, without any nonlinear function applied to its result, does math such as addition and subtraction.

It is true that real neurons are limited in the values they can represent. In fact, neurons fire at a constant, fixed amplitude and its just the frequency of pulses that increase when they get a higher input.

But ignoring that point, the units they have can extrapolate, because they do simple addition and subtraction (point #2).

But wait a minute – what about multiplication and division?

For those operations they make use of a mathematical property of logarithms. The log of (X * Y) is equal to log(X) + log(Y). So if you take logarithms of values before you feed them into an addition neuron, and then the inverse of the log of the result, you have the equivalent of multiplication.

The log is differentiable, so the net can still learn by gradient descent.

So they now need to combine the addition/subtraction neurons with the multiplication/division neurons, and this diagram shows their method:



This fairly simple but clever idea is a breakthrough:

Experiments show that NALU-enhanced neural networks can learn to track time, perform arithmetic over images of numbers, translate numerical language into real-valued scalars, execute computer code, and count objects in images. In contrast to conventional architectures, we obtain substantially better generalization both inside and outside of the range of numerical values encountered during training, often extrapolating orders of magnitude beyond trained numerical ranges.

Neural Arithmetic Logic Units – Andrew Trask, Felix Hill, Scott Reed, Jack Rae, Chris Dyer, Phil Blunsom – Google DeepMind

Neurithmic System’s radical new way of thinking about memory and their program (Sparsey) that implements it

Professor Rod Rinkus  of Neurithmic Systems came up with a net (he calls it SPARSEY) that is about memory – storing memories and retrieving them.   No matter how many memories are already stored, the time to store a new memory, or to retrieve an old one stays the same.   There are some very promising aspects of his idea, and I will explain the general idea below.  If you want to delve further, his actual papers are at his website (sparsey.com).

Suppose you want to store a pattern, perhaps a number.   You could store it as follows;


Here we will assume that only one bit can be ON at a time.   With this constraint, we could only present 5 different numbers to the neural net, and it might learn to associate them with a different positions of the green dot.

This is called a ‘localist’ representation.   One disadvantage of associating patterns this way is that similarity is lost.   The number ‘1’ might be associated with a green dot at the second position, or it might be associated with a green dot at the fifth position.   Also, you can’t store many patterns this way unless you have many neurons.

A more compact way to store data is with a type of number system.   For instance, in everyday math we use base 10 numbers,  (0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10) where a new digit is added to represent 10, then 100, then 1000 etc.   If we only want to use ON and OFF, we can limit our digits to zero and one.   This gives us a base 2 number system.   In that case,

the number zero would be:


The number one would be:


The number two would require an additional digit just like 10 in base 10:


Three would be:


And four would be:


We have already represented 5 numbers, this time with only 3 places needed, and we can represent more, for instance, of all three dots are green (1,1,1) then we have the number ‘7’, using this binary system.

This system is compact, but still, similar numbers are not coded similarly.   Suppose we measured similarity by overlap: the number of dots in the same position that have the green color.   We would see that the number zero has no overlap with the number one, though it is close numerically.  We would see that four has no overlap with three, though three does have one item that overlaps with two.

An ideal memory would code similar items similarly.   The more different the items were, the more different their representations should be.

In the brain, a sofa and a chair should be represented more similarly than a sofa and a mailbox, for instance.   A robin should have a representation closer to a parrot than to a giraffe.

If this can be accomplished, there are several advantages, which I will list after showing one of Gerard Rinkus’s storage units below.


The above is a representation of what he calls a MAC.   Each MAC is made up of several “Competitive Modules”.   In the brain, the Competitive Modules (which he abbreviates as CM) would correspond to cortical Minicolumns, and the MAC would correspond to a cortical MacroColumn.   In this illustration, we are looking at one MAC with three CMs in it.   Each CM has internal competition where only one neuron (in this case each CM has 3 neurons) can win the competition and turn on.   The others lose and turn off.

So what is the advantage of this?   First of all, since each CM can have 3 separate patterns, there are in total 3 * 3 * 3 patterns that can be represented – or 27.   This is more compact than a totally localist method (where of the 9 neurons only one can be on at a time).   It is not as compact as the example we gave of base 2 numbers.

A Sparse-Distributed Representation (SDR) is a representation where only a small fraction of neurons are on.   In the above Mac, only a third of neurons are ON, so it qualifies.   We can interpret the fraction of a feature’s SDR code that overlaps with the stored SDR of a memory (hypothesis) as the probability/likelihood of that feature (hypothesis).

Using Rinkus’s CMs and MACs, we can introduce similarity into our representations that reflects similarity in the world.

Take a look at these 2 MACs

mac mac2

Notice that they do overlap in 2 out of 3 of the CMs.   We could say that their similarity is 2/3.   If they were identical, their similarity would be 3/3.   And so forth.

What is the advantages of representations that reflect similarity?

Well if you have a net (such as Sparsey) that automatically represents similar inputs in a similar way, then you automatically learn hierarchies.   For example, by looking at the SDR for ‘cat’ and ‘dog’, versus the SDRs for ‘cat’ and ‘fish’, you would see that ‘cat’ and ‘dog’ are more similar.

Another interesting advantage is this.   Suppose the MAC on the left represents a cat, and the MAC  on the right represents a dog.    Now you are presented with a noisy input that doesn’t exactly match cat or dog, and gets a representation such as:


This MAC representation overlaps the MACs for dog and for cat equally.   We could say that the probability that what the net saw was a cat is equal to the probability that it saw a dog.

So the fact that similar inputs yield similar representations means that we can look at a SDR as a probability distribution over several possibilities.   This MAC overlaps the representation for dog and for cat by two CMs, but perhaps this MAC might overlap with just one CM a representation for ‘mouse’.

The next figure is from an article by Prof. Rinkus.   Unlike my depictions, his MACS have a hexagonal shape and each CM is a cluster of little circles (neurons).



A single MAC can learn a sequence of patterns over time  by having horizontal connections of every neuron in the MAC connect with a time delay to every neuron (including itself) in the same MAC.  Once it has learned these, then if an input pattern activates the first SDR, then in the next time step it leads to the SDR that represents the second pattern that it was taught, and that in turn leads to the next.   Let us assume that the MAC has many CMs (maybe 70) and each CM might have 20 neurons).  20 to the power of 70 is a large number, and with this large capacity you can store many sequences now without ‘crosstalk’ (or interference) and also, when presented with the start of a sequence that is ambiguous, you can keep multiple possibilities of the endings of that sequence in memory of the net a particular time.   (This reminds me of Quantum computing where multiple possibilities are tried at the same time).

Suppose Sparsey is trained on single words such as:

  1. THEN
  2. THERE
  3. THAT

And so forth.

If the first four words it was trained on are the words above, then if we now want to retrieve a word and present ‘TH’ (the two letters are presented singly, (“T” at time t1, “H” at time t2), then in the above example there are three possibilities for the stored sequence we are attempting to retrieve (THEN, THERE and THAT).   If the next letter that comes in is ‘E’, then there are only two possibilities (THEN and THERE).   If the next letter that comes in is ‘R’, then we are left with just the possibility of the sequence of letters in the word “THERE”.   When we use the principle that similar inputs yield similar SDRs, and we also insist that when a pattern is learned every ON neuron in one MAC learns to increase weights on the same connections as every other neuron, then at any time, all learned sequences stored in memory that match so far are possible, until the ambiguity is resolved by the next input.

Think about the the letter ‘A’ in the above 4 words.   We see that ‘A’ occurs in THAT (one time) and SAILBOAT (in two places).   There are 3 instances of ‘A’ and they cannot be represented in the exact same way (if they were, then there would be no clue of what comes next).   ‘A’ in the context  of THAT does not have the same exact representation as the first ‘A’ in SAILBOAT, and neither have the same exact SDR as the second ‘A’ in SAILBOAT.    Nonetheless, they will have representations that are more similar to each other than to the letter ‘B’ for instance.

Remember that any sequence is represented at a series of time steps, with the position varying but not the sequence.   Think of your own brain.   Your past and your future are all compressed in the present moment.   The past can be retrieved, and the future can be predicted, but at the moment, all you have is the present: a  snapshot of neural firings in your brain.   The same is true in Sparsey of a sequence such as SAILBOAT.   When you reach the first ‘A’ of sailboat it has all the information to complete that word, assuming that only the above 4 words were learned.   There is no ambiguity.   But that is only true because the pattern for this ‘A’ is slightly different than for the other ‘A’s (such  as the ‘A’ in THAT’).  They don’t overlap completely.

So how does Sparsey achieve the property of similar inputs giving rise to similar memories?

First we need to know that each neuron in each CM of a particular MAC has exactly the same inputs.  It may not have the same weights applied to those inputs, but it has the same inputs.   The inputs come from a lower level, which might be a picture in pixels, or if we have a multilevel net, might be from another abstract level.   Initially, all weights are zero.


Sparsey’s core algorithm is called the Code Selection Algorithm (CSA).   We’ll say that in every CM there are K neurons.   In each MAC (there can be several per level in a multilevel net) there are Q CMs.

CSA Step 1 computes the input sums for all Q×K cells comprising the coding field.  Specifically, for each cell, a separate sum is computed for each of its major afferent synaptic projections.

The cells also have horizontal inputs from cells within some maximum perimeter around the MAC and may have signals also coming down on connections from a layer above them.   But we’ll focus on just the inputs coming from below. The following is a very simplified version of what happens:

As in typical neural nets, each neuron in a MAC has an activation ‘V’ equal to the sum of the product of weights on  a connection times the signal coming over that connection.

Then these sums are normalized so that none exceed one and none are less than zero, but they retain their relative magnitudes or ‘V’ values for each neuron.

Now find the Max V in each CM and tentatively pick the neuron with that value to be the ON neuron in that CM

Finally, a measure called G is computed as the average max-V across the Q CMs.  In the remaining CSA steps, G is used, in each CM, to transform the V distribution over the K cells into a final probability distribution from which a winner is picked.  G’s influence on the distributions can be summarized as follows.

  1. a) When high global familiarity is detected (G is close to 1), those distributions are exaggerated to bias the choice in favor of cells that have high input summations.
  2. b) When low global familiarity is detected (G is close to 0), those distributions are flattened so as to reduce bias due to local familiarity

G does this indirectly, by modifying a ‘sigmoid’ curve that is applied to each neuron’s output.

The lower level in the next picture has a sigmoid curve (the red shaped curve to the right) that has a normal height.   The upper level has a sigmoid curve that has been flattened.   We can see that in the lower level’s sigmoid function, Y-axis values are farther apart (at least in the middle of the ‘S’) than in the second.   The lower level here, we assume, had a larger G than the upper level did, so the CSA calculates a taller sigmoid to apply to the neurons in that level.   If a sigmoid is flattened, and the probability of the most likely neuron is thus made to be closer to the probability of the second most likely and the third most likely, then there is a greater chance that a neuron other than the one with the highest weighted input summation is the one that will fire, and be part of the memory for this neuron   Since low G means low confidence (or low familiarity) we do want the new SDR to have some differences from whatever SDR the collection of V’s seem closest too.   Having probabilities that are close together makes differences more likely.

Suppose you see a prototypical cat that is just like the pet cat owned by your neighbor.   You already have a memory that matches very closely (your G is high).   Now suppose you see an exotic breed of cat that you’ve never encountered.   It matches all stored traces of cats less well, and therefore the memory that the CSA creates for it should be somewhat different.   So even though the V’s may approximate a cat (or intersection of cats) that you’ve seen before, applying the flattened sigmoid and then using a toss of the dice on which neuron will win in each CM, will lead to at least some CMs with different neurons firing than in the prototypical cat representation.  The flatter the sigmoid, the more likely a CM is to have finally selected a different neuron than the favored one to be On.

The connections from the inputs in the receptive field of the MAC (in the lower level) will strengthen to those neurons finally chosen in the SDR in the level above it.   Synapses are basically binary, though their strengths can decay, and neuron activations are binary too.


Any finite net that stored many memories can run into a problem of interference, or “cross-talk”.   The problem is that there are so many learned links, that you can have similar patterns that differ by very few neurons and can be confused with each other.   You can also get patterns that are hybrids of others and never were actually encountered in real life.   The CSA actually freezes the number of SDRS a MAC can learn after a critical period, to attempt to avoid this problem.   In a multilevel net this is not necessarily a limitation.

I sent a few questions of about human mental abilities and weaknesses to Professor Rinkus and he had interesting replies.

I asked about memories that are false, or partly false, and he said this:

Let’s consider an episodic memory example, my 10th birthday, with different features, where it was, who was there, etc.  That episodic memory as a whole is spread out across many of my macro-columns (“macs”), across all sensory modalities.  But those macs have been involved in another 50 more years of other episodic memories as well.  In general, the rate at which new SDR codes, and thus the rate at which crosstalk accrues, may differ between them.  So, say one mac M1 where a visual image of one of my friends at the party, John, is stored has had many more images of John and other people stored over the years, and is quite full (specifically, ‘quite full’ means that so many SDRs have been stored that the average Hamming distance between all those stored codes has gotten low).  But suppose another mac, M2, where a memory trace of some other feature of the party, say, “number of presents I got”, say 10, was stored ended up having far fewer SDRs stored in it over the years, and so, much less crosstalk.  (After all, the number of instances where I saw a person is vastly greater than the number of instances where I got presents, so the hypothetical example has some plausibility).  So now, when I try to remember the party, which ideally would mean reactivating the entire original memory trace, across all the macs involved, as accurately as possible, including with their correct temporal orders of activation, the chance of activating the wrong SDR in M1 (e.g., remembering image of other friend, Bill, instead of John), is higher than activating the wrong trace in M2…so I remember (Bill, 10) instead of (John, 10).   The overall trace I remember is then a mix of things that actually happened in different instances, e.g., confabulation.

He also said this:

Whenever you recognize any new input as familiar, reactivation of the original trace must be happening.  So, the act of creating new memories involves reactivation of old memories. But reactivating old memory traces becomes increasingly subject to errors due to increasing crosstalk.  So, if my macs are already pretty full, then as I create brand new memory traces, they could include components that are confabulations…i.e., the memories are wrong from inception.

So Professor Rinkus is saying that a false memory can be wrong not only due to an oversupply of similar memories that affects the retrieval process, but can be wrong even at the time it was stored!

I would add that some memories are false because you don’t remember the source.   If you are told at one point that as a child, you were lost in a mall, even if that’s not true, years later you may have a memory that you were, and you may even fill in details of how it happened and how you felt.

Then I asked this question:

According to Wikipedia: “Eidetic memory sometimes called photographic memory) is an ability to vividly recall images from memory after only a few instances of exposure, with high precision for a brief time after exposure, without using a mnemonic device.”   In your theory it would seem that everyone should have this memory, since every experience leaves a trace.   Why then, do only a few people have this ability?

I include a part of his answer below:

My general answer is that when we are all infants/young and we have not stored much information (in the form of SDRs) in the macs comprising our cortex, and so the amount of crosstalk interference between memories (SDR codes, chains of SDRs, hierarchies of chains of SDRs) is low, we all have very good episodic memory, perhaps approaching eidetic to varying degrees and in various circumstances.  But as we accumulate experience, storing ever more SDRs into our macs, the level of crosstalk increases, and increasing mistakes (confabulations) are made.  From another point of view, since these confabulations are generally semantically reasonable, we can say that as we age, our growing semantic memory, i.e., knowledge of the similarity structure of the world, gradually becomes more dominant in determining our responses/behavior (we accumulate wisdom)….  I think those who retain extreme eidetic ability into their later years, and perhaps autistics, may have a brain difference  that makes the sigmoid stay much flatter than for normals, i.e., the sigmoid’s dependence on G is somehow muted.

His speculation makes sense because if the sigmoid is very flat, then new SDRs that are stored for new patterns will be less likely to overlap much with existing SDRs.   Every cat you encounter that is slightly different than an old cat, will have its own representation.

If you are interested in more details of the model (I’ve left out many), take a look at Professor Rinkus’s website (sparsey.com).

(you can obtain both from the publications tab of Sparsey.com):
A Radically New Theory of how the Brain Represents and Computes with Probabilities – (2017)
Sparsey™: event recognition via deep hierarchical sparse distributed codes – (2014)