Randall C. O’Reilly, Dean R. Wyatte, and John Rohrlich at the Department of Psychology and Neuroscience of the University of Colorado Boulder have come up with a new model of how the brain learns. As they put it in their article titled “Deep Predictive Learning”:
where does our knowledge come from? Phenomenologically, it appears to magically emerge after several months of[an infant] …gaping at the world passing by — what is the magic recipe for extracting high-level knowledge from an ongoing stream of perceptual experience?
Their model has several basic elements.
There is an ‘error’ signal that reflects the discrepancy between why the brain expects to happen, and what actually is experienced.
That idea is not new, in fact there are many hierarchical models (known as Hierarchical Generative Models) where ‘predictions’ flow down the hierarchy, but sensory data flows up. (The higher level of the hierarchy generally represents more abstract features of the input, so if a low level represents edges between dark and light, a higher level might represent a shape such as a cylinder) The two flows subtract at a group of error neurons at each level. For example, if level 1 is the lowest level, it might receive sensory input from the eyes, but also receives a prediction from the next higher level, level 2. The two signals subtract at a specialized group of neurons in level 1 called ‘error’ neurons, and these then pass the error signal (mismatch between predictions and actual sensory input) up to level 2. (In this model level 2 also gets the processed sensory information passed up from level 1). Level 2 corrects its connections based on a learning algorithm, using the error information. By this process level 2 learns to represent the layer below it (level 1).
The same thing happens between level 2 and level 3, here level 2 passes up abstracted sensory information, level 3 passes down its predictions, and error neurons on level 2 compute the mismatch between the two. The error neurons then pass up the error to level 3 for level 3 to use in refining its predictions for next time. Eventually level 3 learns, in an abstract way to represent level 2.
It is typical to train a model like this from the bottom up, so first layer 2 learns to represent layer 1, then layer 3 learns to represent layer 2, etc.
The Boulder group see things differently, however. In their model, the various successive levels in the vision hierarchy (V1, V2, V3, etc.) all pass sensory information (or abstractions of sensory information) to an area of the thalamus called the ‘pulvinar’. The pulvinar is like a blackboard in a classroom, where each student has reserved his own small part of the blackboard to make a note or draw a picture. Each student can see the entire blackboard but can only modify his own part of it. The analogy to a classroom doesn’t work too well though, because in their model the blackboard is showing error – or mismatch – between sensory information and predictions of what that information should be. For instance, one section of the blackboard might be comparing the abstracted sensory information from level V2 with the prediction from level V2.
An interesting aspect of this model is that a higher level in the visual hierarchy gets error information from all the levels below it, not just from the layer immediately below it. It sends both predictions and actual information to just one part of the pulvinar but gets back error signals from all parts of the pulvinar.
In this mode, the highest level abstract information is also sent via the pulvinar to the lowest layer.
The comparison between sensory signals and predicted signals uses a time difference. The past is used to predict the future. For instance, in a 100-millisecond interval, the first 75 milliseconds is used to send a prediction signal to the pulvinar neurons, and then the next 25 milliseconds are used to send the current (abstracted) sensory information to those same neurons. The first 75 milliseconds altered the properties of those neurons – perhaps their thresholds adapted, and the input of the next 25 milliseconds is decremented by this new threshold. That neural firing, which is a difference between predicted and sensory then is fed back from the pulvinar toward the various visual levels. The first 75 millisecond prediction signal is called the ‘minus’ phase, and the second ‘ground truth’ signal is called the ‘plus’ phase. (minus comes before plus, predictions come before ‘ground truth’ signals that they are then compared with). The model thus explains why the brain has an ‘alpha’ rhythm, which repeats 10 times a second (so each cycle is 100 milliseconds long). Both the deep layer and the pulvinar show this rhythm. The superficial layers do not show the rhythm.
In an infant, some areas of the brain work before others.
The infant may see a blob moving, without the details, but this is enough to provide an error mismatch as to ‘where’ something is, and the ‘where’ error corrects the spatial movement predictions to the point that now the brain can concentrate on the ‘what’. The authors write:
Unlike the formation of invariant object identity abstractions (in the What pathway), spatial location (in retinotopic coordinates at least) can be trivially abstracted by simply aggregating across different feature detectors at a given retinotopic location, resulting in an undifferentiated spatial blob. These spatial blob representations can drive high level, central spatial pathways that can learn to predict where a given blob will move next, based on prior history, basic visual motion filters, and efferent copy inputs…
This is a division of labor, where one problem must be solved before another is addressed. (blogger aside – perhaps this has implications for backpropagation nets in general – you could have a few hidden nodes involved in learning one aspect of a scene, then freeze their weights so they don’t change, and then add more hidden nodes to learn other aspects of the scene).
The authors also believe that the need to address problems in order explains a finding – that the lowest layer’s error signal goes, not just to the next higher level, but also to the highest levels. They write:
We have consistently found that our model depends critically on all areas at all levels receiving the main predictive error signal generated by the V1 layer 5IB driver inputs to the pulvinar in the plus phase. This was initially quite surprising at a computational level, as it goes strongly against the classic hierarchical organization of visual processing, where higher areas form representations on top of foundations built in lower areas — how can high-level abstractions be shaped by this very lowest level of error signals? We now understand that the overall learning dynamic in our model is analogous to multiple regression, where each pathway in the model learns by absorbing a component of the overall error signal, such that the residual is then available to drive learning in another pathway. Thus, each factor in this regression benefits by directly receiving the overall error signal…
(In multiple regression, the main component of a plotted curve is isolated first, and then secondary and then tertiary components)
So how does the mechanism work? The cortex has six layers. layer 1 is closest to the skull, layer 6 is deepest. So, we call layers 2,3, and 4 superficial, and layers 5 and 6 ‘deep’.
In the deep predictive model the superficial layers reflect the current state of the environment while the deep layers generate predictions. A region such as V1, which is the lowest in the visual hierarchy, has six layers, and a region such as V3, which is higher, has the same organization of six layers. In the picture below, you can see how a lower level (the bottom square with the 3 colors) interacts with a higher level (the top square). In both squares, blue stands for ‘superficial layers’, pink stands for ‘deep layers’ The green layer is also considered part of the superficial layers, but it is shown separately because it receives inputs.
You can see that the feedforward signals go from the superficial layers (of the lower layer) to layer 4 in the upper level. But there are also feedback connections. The superficial layers of the top level go down and feed into both superficial and deep layers in the lower level. The deep layers of the top level also send feedback to these two areas. But most feedforward projections originate from superficial layers of lower areas and deep layers predominantly contribute to feedback.
For the deep levels to generate predictions, they must temporarily not be updated by outside information. If you are playing the game “pin the tail on the donkey”, you wear a blindfold so that you won’t see where the donkey is pasted to the wall. You can’t make an honest prediction if you peek under the blindfold.
The authors put it like this:
Well-established patterns of neocortical connectivity combine with phasic burst firing properties of a subset of deep-layer neurons to effectively shield the deep layers from direct knowledge of the current state, creating the opportunity to generate a prediction. Metaphorically, the deep layers of the model are briefly closing their “eyes” so that they can have the challenge of predicting what they will “see” next. This phasic disconnection from the current state is essential for predictive learning (even though it willfully hides information from a large population of neurons, which may seem counter-intuitive) …
Suppose we just focus on region V2 and the region below it, V1. The deep layer of V2, layer 6, which has been shielded for 75 milliseconds from any real-world corrections, sends its predictions to the pulvinar. Shortly after, for a duration of 25 milliseconds, V1’s own deep layer 5 (not 6) sends actual real values to the pulvinar via 5IB neurons. At the pulvinar the two projections impinge on error neurons which calculate a mismatch and send that mismatch back via projections not only to V2, but also to higher layers, which then use an algorithm (equivalent to backpropagation) to learn from the error.
(It was long thought that backpropagation was not realistic in the brain, but if you consider that many connections in the cortex are bidirectional, you can see that the ‘error gradients’ of backpropagation do have a pathway to pass backwards, the opposite direction of feedforward inputs. The Colorado group has an online book where they go into detail on this (https://grey.colorado.edu/CompCogNeuro/index.php/CCNBook/Main)
Recall that V2’s deep layer has been shielded from the real world for 75 milliseconds. Now it is time to briefly ‘unshield’ it, and that is done by having V2’s superficial layer update the content of the V2 deep layer.
In the next image, the ‘prediction’ or ‘minus’ phase is happening first. If you start at the middle row of boxes (which represent the pulvinar), it sends an error that is obtained from comparing its minus and plus phase to both the superficial and deep layers (in this case of V2). At the last 25 milliseconds, the 5IB neurons (diagonal line going down to the right from the superficial layers of V2) update the deep layer of V2. These same 5IB cells drive a plus phase in a higher area of the pulvinar (in this diagram, you can see at the bottom, V1 5IB neurons sending plus phase information to the area of the pulvinar that will in turn send error information to V2)
The authors believe that during the next 75 milliseconds, the deep layer is remembering its inputs by a reverberating signal. The superficial layers, which are not shielded from the environment, are attempting to settle down as a ‘Hopfield Network’ might – trying to satisfy all the conflicting constraints of information that arrives from the senses as well as the information that arrives from higher level superficial layers. An example of this behavior by the superficial layers is when you look at the ‘Dalmatian illusion’ – its hard to figure out what the picture shows unless someone tells you it’s a Dalmatian, in which case top-down information combines with the information you see to settle on the correct interpretation.
We just said that the deep layer has to be isolated from the outside world, part of the time. In the diagram with the 3 colors above, we see that the ‘superficial’ layer of the low level, which represents ‘current world’ information, does not feed into the ‘deep’ layer of the level above it. This makes sense, because the ‘deep’ level is making the prediction, and we don’t want that ‘prediction’ to be short-circuited by real-world information.
It might seem contradictory though that there is a feedback from the superficial layer of the higher level to the deep layer of the lower level. After all, that superficial layer is also carrying ‘current real world’ information. The authors explain that this layer is more abstract, so it doesn’t carry real-world details that have to be predicted at the lower level.
There is a paradox in this model:
To predict that in the next instant, a cat will move its paw toward a ball of string, you need to feed high level information about the shape of a cat and the properties of a cat. If you are going to learn at the lowest level, V1, what low level features to expect next when seeing part of the visual field that has a cat in it, it would help to already know what a cat was. But on the other hand, how can we develop the abstract generalization of “cat” when we don’t yet know anything about fur, paws, teeth, etc?
As they put it:
How can high-level abstract representations develop prior to the lower-level representations that they build upon?
In other words, this is a case where “it takes the whole network to raise a model”—the entire predictive learning problem must be solved with a complete, interacting network, and cannot be solved piece-wise.
The next two diagrams are from their paper. The first shows various levels divided into superficial (blue) and deep (pink). If you look at ‘V1’ for example, you see that its superficial layer feeds into the superficial layer of V2. The main thing to notice is that deep layers don’t receive input from superficial layers of levels that are lower than the level they are part of. Also the superficial layer of TEO, a level at the top of the hierarchy, sends feedback to a level near the bottom – the deep layers of level V2, and V2’s superficial layers sends a signal directly to a level at the top (MT). TEO’s deep layer sends a feedback signal to V2’s superficial layer as well. The original caption for this figure states: “Special patterns of connectivity from TEO to V3 and V2, involving crossed super-to-deep and deep-to-super pathways, provide top-down support for predictions based on high-level object representations (particularly important for novel test items).”
The second diagram shows only the deep section of each level (purple) next to a corresponding section of the pulvinar that it sends projections to. As you can see, the pulvinar that receives V1 information sends out error information via projections to many other areas. The original caption for the figure states: “Most areas send deep-layer prediction inputs into the main V1p prediction layer, and receive reciprocal error signals therefrom. The strongest constraint we found was that pulvinar outputs (colored green) must generally project only to higher areas, not to lower areas, with the exceptions of MTp –> V3 and LIPp –> V2”
The exciting aspect of this model is that without a teacher, it learns to recognize objects, irrespective of movement.
The authors are now looking at modeling 3D objects and also plan to combine auditory information with the visual model they have constructed so far.
Deep Predictive Learning: A Comprehensive Model of Three Visual Streams
Randall C. O’Reilly, Dean R. Wyatte, and John Rohrlich Department of Psychology and Neuroscience University of Colorado Boulder, 345 UCB, Boulder, CO 80309
CCNBook — https://grey.colorado.edu/CompCogNeuro/index.php/CCNBook/Main
e-cortex is a company that Randall O’Reilly works at – eCortex research has been commissioned by some of the biggest government and commercial entities with interest in applied brain-based artificial intelligence
to be more accurate, the deep layer represents 100 milliseconds of information, not just 75, because its processing overlaps with the 25 milliseconds of the ‘ground truth signal’ that is sent to the pulvinar.