Making Neural Nets more decipherable and closer to Computers

In an article titled “Neural Turing Machines“, three researchers from ‘Google DeepMind’ Alex Graves, Greg Wayne, and Ivo Danihelka describe a neural net that has a new feature, a memory bank. The system is similar in this respect to a Turing Machine – which was originally proposed by Alan Turing in 1936. His hypothetical machine had a read/write head that wrote on squares on a tape, and could move to other squares and read from them as well. So it had a memory. In theory, it could compute anything that modern computers could compute, given enough time.

One advantage of making a Neural Net that is also a Turing machine is that it can be trained with gradient descent algorithms.   That means it doesn’t just execute algorithms, it learns algorithms (though, if you want to be fanatical, you might note that since a Turing machine can simulate any recipe that a computer can execute, it could simulate a neural net that learns as well).

The authors say this:

Computer programs make use of three fundamental mechanisms: elementary operations (e.g., arithmetic operations), logical flow control (branching), and external memory, which can be written to and read from in the course of computation. Despite its wide-ranging success in modelling complicated data, modern machine learning has largely neglected the use of logical flow control and external memory.

Recurrent neural networks (RNNs) …are Turing-Complete and therefore have the capacity to simulate arbitrary procedures, if properly wired. Yet what is possible in principle is not always what is simple in practice. We therefore enrich the capabilities of standard recurrent networks to simplify the solution of algorithmic tasks. This enrichment is primarily via a large, addressable memory, so, by analogy to Turing’s enrichment of finite-state machines by an infinite memory tape, we dub our device a “Neural Turing Machine” (NTM). Unlike a Turing machine, an NTM is a differentiable computer that can be trained by gradient descent, yielding a practical mechanism for learning programs.

They add that in humans, the closest analog to a Turing Machine is ‘working memory’ where information can be stored and rules applied to that information.

…In computational terms, these rules are simple programs, and the stored information constitutes the arguments of these programs.

A Neural Turing memory is designed

to solve tasks that require the application of approximate rules to “rapidly-created variables.” Rapidly-created variables are data that are quickly bound to memory slots, in the same way that the number 3 and the number 4 are put inside registers in a conventional computer and added to make 7.

… In [human] language, variable-binding is ubiquitous; for example, when one produces or interprets a sentence of the form, “Mary spoke to John,” one has assigned “Mary” the role of subject, “John” the role of object, and “spoke to” the role of the transitive verb.

A Neural Turing Machine (NTM) architecture contains two components: a neural network controller and a memory bank.

Like most neural networks, the controller interacts with the external world via input and output vectors. Unlike a standard network, it also interacts with a memory matrix…. By analogy to the Turing machine we refer to the network outputs that parametrize these operations as “heads.”
Crucially, every component of the architecture is differentiable, making it straightforward to train with gradient descent. We achieved this by defining ‘blurry’ read and write operations that interact to a greater or lesser degree with all the elements in memory (rather than addressing a single element, as in a normal Turing machine or digital computer).

In a regular computer, a number is retrieved by fetching it at a given address.

Their net has two differences in retrieval from a standard computer.   First of all, they retrieve  an entire vector of numbers from a particular address.   Think of a rectangular matrix, where each row number is an address, and the row itself is the vector that is retrieved.

Secondly, instead of retrieving at just one address, there is a vector of weights that controls the retrieval at multiple addresses.    The weights in that vector add up to ‘1’.   Think of a memory matrix consisting of 5 vectors.   There will be 5 corresponding weights.

If the weights were:


then only one vector will be retrieved, the vector at the third row of the matrix.  This is similar to ordinary location based addressing in computers or Turing machines. You can also shift that ‘1’ each cycle, so that it retrieves an adjacent number each time (to the number retrieved before).

Now think of the following vector of weights:


In this case two vectors are retrieved (one from the 2nd row, and one from the third).   The first one has all its elements multiplied by 0.3, the second has all its elements multiplied by 0.7, and then the two are added.   This gives one resultant vector.  They say this is a type of ‘blurry’ retrieval .

They use the same idea when writing to memory – a vector is used to relatively weight the different values written to memory.

This vector-multiplication method of retrieval allows the entire mechanism to be trained by gradient descent.  It also can be thought of as an ‘attentional mechanism” where the focus is on the vectors with relatively high corresponding weights.

Some other nets do a probabilistic type of addressing, where there is a probability distribution over all the vectors, and at each cycle the net uses  most probable (perhaps with a random component).   But since neural Turing machines learn by gradient descent, the designers had to use the distribution to obtain a weighted sum of memory vectors is retrieved.   This was not a bug, but a feature!

They say:

The degree of blurriness is determined by an attentional “focus” mechanism that constrains each read and write operation to interact with a small portion of the memory, while ignoring the rest… Each weighting, one per read or write head, defines the degree to which the head reads or writes at each location. A head can thereby attend sharply to the memory at a single location or weakly to the memory at many locations.

Writing to memory is done in two steps:

we decompose each write into two parts: an erase followed by an add.
Given a weighting wt emitted by a write head at time t, along with an erase vector et whose M elements all lie in the range (0,1), the memory vectors M{t-1}(i) from the previous time-step are modified as follows:


where 1 is a row-vector of all 1’s, and the multiplication against the memory location acts point-wise. Therefore, the elements of a memory location are reset to zero only if both the weighting at the location and the erase element are one; if either the weighting or the erase is zero, the memory is left unchanged.
Each write head also produces a length M add vector at, which is added to the memory after the erase step has been performed:


The combined erase and add operations of all the write heads produces the final content of the memory at time t. Since both erase and add are differentiable, the composite write operation is differentiable too.

The network that outputs these vectors and that reads and writes to memory, as well as taking inputs and producing outputs, can be a recurrent neural network, or a plain feedforward network.   In either case, the vector used to retrieve from memory locations is then fed back, along with the inputs, into the net.

The authors trained their net on various problems, such as copying a sequence of numbers, or retrieving the next number in an arbitrary sequence given the one before it. It came up with algorithms such as this one, for copying sequences of numbers: (in the following, a ‘head’ can be either a read-head or a ‘write-head’ and has a vector of weights associated with it to weight the various memory vectors for the process of combination and retrieval, or for writing.)

initialise: move head to start location
while input delimiter not seen do

receive input vector
write input to head location
increment head location by 1

end while
return head to start location
while true do

read output vector from head location
emit output
increment head location by 1

end while

This is essentially how a human programmer would perform the same task in a low-level programming language. In terms of data structures, we could say that NTM has learned how to create and iterate through arrays. Note that the algorithm combines both content-based addressing (to jump to start of the sequence) and location-based addressing (to move along the sequence).

The way the NTM solves problems is easier to understand than trying to decipher a standard recurrent neural net, because you can look at how memory is being addressed, and what is being retrieved and written to memory at any point.
There is more to the NTM than I have explained above as you can see from the following diagram from their paper:


Take home lesson: The Turing Net outperforms existing architectures such as LSTMs (neural nets where each unit has a memory cell, plus trainable gates to decide what to forget and what to remember), and it generalizes better as well. It is also easier to understand what the net is doing, especially if you use a ‘feedforward net’ as the ‘controller’. The net doesn’t just passively compute outputs, it decides what to write to memory, and what to retrieve from memory.

Neural Turing Machines by Alex Graves, Greg Wayne and Ivo Danihelka – Google DeepMind, London, UK (

Making Recurrent neural net weights decipherable – new ideas.

One problem with neural nets is that after training, their inner workings are hard to interpret.
The problem is even worse with recurrent neural networks, where the hidden layer sends branches back to feed, along with the inputs in the next time step, back to itself.

Before I talk about how the problem has been tackled, I should mention an improvement to standard recurrent nets, which was called by its authors (Jurgen Shmidhuber and Sepp Hochreiter) LSTM (Long Short Term Memory). The inventors of this net realized that backpropagation isn’t limited to training a relation between two patterns, it can also be used to train gates that control the learning by the other gates.  One such gate is a ‘forget gate’. It uses a ‘sigmoid function’ on the weighted sum of its inputs. Sigmoid functions are shaped like a slanted letter ‘S’, and the bottom and top of the ‘S’ are at zero and 1 respectively. This means that if you multiply a signal by the output of a sigmoid function, at one extreme you could be multiplying by zero, which means that the product is zero too, which means no signal gets through the gate. At the other extreme, you would be multiplying by 1, so that the entire signal gets through. Since sigmoid gates are differentiable, backpropagation can be used on them. In an LSTM, you have a cell-state that holds a memory value, as well as having one or more outputs. In addition to the standard training, you also train a gate to decide how much of the past ‘memory’ to forget on each time-step as a sequence of inputs are presented to the net. A good explanation of LSTMS is at:, but the point to remember is that you can train gates to control the learning process of other gates.

So back to making sense of the weights of recurrent nets. One approach is the IndRNN (Independently Recurrent Neural Network). If you will recall, a recurrent net with 5 hidden nodes would not only feedforward 5 signals into each neuron of its output layer, but would send 5 branches with the signals from the 5 hidden nodes as 5 extra ‘inputs’ to join the normal inputs in the next time step.  If you had 8 inputs, then in total you would have 13 signals feeding into every hidden node. Once a net like this is trained, the actual intuitive meaning of the weights is hard to unravel, so the authors asked – why not just feed each hidden node into itself, this keeping the hidden nodes independent of each other. Each node still gets all the normal signals from inputs it would normally get, but in the above example, instead of getting 5 signals from the hidden layer’s previous time step as well, it gets just one extra signal instead – that of itself on the previous time step. This may seem to reduce the power of the net since there are fewer connections, but it makes the net more powerful.  One plus is that with this connectivity, the net is able to train on many layers in each time step. Another plus is that the neurons don’t have to use ‘S’ shaped functions, they can work with non-saturated activation functions such as RELU (rectified linear unit – which is a diagonal line when the weighted sum of neural inputs is zero and above and otherwise is a horizontal line with value zero).


It is easier to understand what a net like this is doing than a traditional recurrent net.

Another ingenious idea came from a paper titled Opening the Black Box: Low-dimensional dynamics in high-dimensional recurrent neural networks, by David Sussilo of Stanford and Omri Barak of the Technion.

A recurrent network is a non-linear dynamic system, in that at any time step, the output of a computation is used for the inputs of the next time-step, where the same computation is made. One the weights are learned, you can write the computation of the net as one large equation.  In the equation below the J matrix is the weights from the context (hidden units feeding back) and the B matrix is the weights for the regular inputs and h is a function such as hypertangent.  The symbol x is the union of u and r, where u are the signals from the input neurons.


The systems described by these equations can have attractors, such as fixed points. You can think of fixed points as being at the bottom of a basin in a landscape. If you roll a marble anywhere into the valley, it will roll to the bottom. In the space of patterns, all patterns that are in the basin will evolve over time to the pattern at the bottom. Attractors do not have to be fixed points, they can be lines, or they can be a repeating sequence of points (the sequence repeats as time goes by), or they can never repeat but still be confined in a finite space – those trajectories in pattern space are called ‘strange attractors’. A fixed point can be a point where all neighboring patterns eventually evolve to end up, or it can be a repeller, so that all patterns in their neighborhood evolve to go away from it. Another interesting type of fixed point is a saddle. Here patterns in some directions evolve toward the point, but patterns in other directions evolve to go away from it. Think of a saddle of a horse. You can fall off sideways (that would be the ‘repeller’), but if you were jolted forward and upward in the saddle, you would slide back to the center (the attractor).


So Sussilo and Barak looked for fixed points in recurrent networks. They also looked for ‘slow points’ that is points that attract, but eventually drift. I should mention here that just like in a basin, the area around a fixed point is approximately linear (if you are looking at a small area). As patterns approach an attractor, usually they start off quickly, but the progress slows down the closer they get to it.

The authors write:

Finding stable fixed points is often as easy as running the system dynamics until it converges (ignoring limit cycles and strange attractors). Finding repellers is similarly done by running the dynamics backwards. Neither of these methods, however, will find saddles. The technique we introduce allows these saddle points to be found, along with both attractors and repellers. As we will demonstrate, saddle points that have mostly stable directions, with only a handful of unstable directions, appear to be of high significance when studying how RNNs accomplish their tasks.

Why is finding saddles valuable?

A saddle point with one unstable mode can funnel a large volume of phase space through its many stable modes, and then send them to two different attractors depending on which direction of the unstable mode is taken.

Consider a system of first-order differential equations

where x is an N-dimensional state vector and F is a vector function that defines the update rules (equations of motion) of the system. We wish to find values round which the system is approximately linear. Using a Taylor series expansion, we expand F(x) around a candidate point in phase space:

(A Taylor expansion uses the idea that if you know the value of  a function at a point X, you can find the value of the function function at a point (x + delta-x), using first order derivatives, second order derivatives, up to n’th order derivatives)

The authors say that “Because we are interested in the linear regime, we want the first derivative term of the right hand side to dominate the other terms, so that


They say that his observation “motivated us to look for regions where the norm of the dynamics, |F(x)|, is either zero or small. To this end, we define an auxiliary scalar function.   In the caption of the equation, they explain that there is a intuitive correspondence to speed in the real physical world:

A picture that shows a saddle with attractors on either side follows:


The authors trained recurrent nets on several problems, and found saddles between attractors, which allowed them to understand how the net was solving problems and representing data. One of the more difficult problems they tried was to train a recurrent net to produce a sine wave given an input that represented the desired frequency. They would present an amplitude that represented a frequency range, (the higher the amplitude of the input signal, the higher the frequency they wanted the net output to fire at) and they trained the output neuron to fire at a frequency proportional to that input. When they analyzed the dynamics, they found that, even though fixed points were not reached,

For the sine wave generator the oscillations could be explained by the slightly unstable oscillatory linear dynamics around each input-dependent saddle point.

I’m not clear on what the above means but it is known that you can have limit cycles around certain types of fixed points (unstable ones).  In the sine wave example the location of attractors and saddle points differ depending on what input is presented to the network. In other problems they trained the net with, the saddle point(s) was at the same place, no matter what inputs were presented  because the analysis was done in the absence of input – maybe because the, input was transient (applied for a short time), whereas in the sine wave it was always there.  So in the sine wave example, if you change the input, you changed the whole attractor landscape.

They also say that one reason studying slow points, as opposed to just fixed points was valuable, since

funneling network dynamics can be achieved by a slow point, and not a fixed point

(as shown in the next figure):


A mathematician who I’ve corresponded with told me his opinion of attractors.  He wrote:

I think that:
• a memory is an activated attractor.
• when a person gets distracted, the current attractor is destroyed and gets replaced with another.
• the thought process is the process of one attractor triggering another, then another.
• memories are plastic and can be altered through suggestion, hypnosis, etc.  Eye witness accounts can be easily changed, simply by asking the right sequence of questions.
• some memories, once thought to be long forgotten, can be resurrected by odors, or a musical song.

One can speculate that emotions are a type of attractor.   When you depressed, the types of thoughts you have are sad ones, and when you are angry at a friend, you dredge up  the memories of the annoying things they did in the past.

In the next post, I’ll discuss a different approach to understanding a recurrent network. Its called a “Neural Turing Machine”. I’ll explain a bit about it here.

It had been found by Kurt Godel that certain problems could not be solved by any set of axioms.

There had been a half-century of attempts, before Gödel came along to find a set of axioms sufficient for all mathematics, but that ended when he proved the “incompleteness theorem”.

In hindsight, the basic idea at the heart of the incompleteness theorem is rather simple. Gödel essentially constructed a formula that claims that it is unprovable in a given formal system. If it were provable, it would be false. Thus there will always be at least one true but unprovable statement. That is, for any computably enumerable set of axioms for arithmetic (that is, a set that can in principle be printed out by an idealized computer with unlimited resources), there is a formula that is true of arithmetic, but which is not provable in that system.

In a paper published in 1936 Alan Turing reformulated Kurt Gödel’s 1931 results on the limits of proof and computation, replacing Gödel’s universal arithmetic-based formal language with hypothetical devices that became known as Turing machines. These devices wrote on a tape and then moved the tape, but they could compute anything (in theory) that any modern computer can compute. They needed a list of rules to know what to write on the tape in different conditions, and when and where to move it.
So Alex Graves, Greg Wayne and Ivo Danihelka of Google DeepMind in London came up with the idea to make a recurrent neural net with a separated memory section that could be looked at as a Turing machine with its tape. You can see their paper here: I’ve corresponded with one author, and hopefully can explain their project in my next post.

Opening the Black Box: Low-dimensional dynamics in high-dimensional recurrent neural networks by David Sussilo and Omri Barak (
Independently Recurrent Neural Network (IndRNN): Building A Longer and Deeper RNN – by Shuai Li, Wanqing Li, Chris Cook, Ce Zhu, Yanbo Gao (

When your character flaw is due to brain chemicals

Blue Dreams is a new book by Lauren Slater – its describes the drugs that have been invented to treat mental illness. Lauren is a science journalist who has needed some of those drugs. Its a very well written book, with both fascinating history and science in it. But lets concentrate here on the science.
The question we will ask is, can character flaws be created by a simple neurotransmitter imbalance?
I think the answer is yes, because of this paragraph in her book (warning: she includes some explicit terms):

Because Prozac dampens the sex drive, psychiatrists often use it to treat compulsive masturbators and others with heightened libido or sexual-addiction disorders that leave their lives in shreds. Martin Kafka, a psychiatrist at McLean Hospital in Belmont, Massachusetts,…, has an entire practice comprising men who are addicted to sex. These are largely married men who nevertheless seek out prostitutes and pornography, not once a day, not twice day, but twenty or thirty times in a twenty-four-hour period. Men haunted and ravaged by their own internal fires, men eaten alive by uncontrollable desire, men whose brains are likely damp with dopamine coursing down dendrites and being sucked up by axons in a never-ending obsessive circuit. Kafka is not the only psycho-pharmacologist who uses Prozac and its chemical cousins in this manner. The literature is rife with cases […], all trained and tamed by serotonin-boosting drugs. Kafka has seen these drugs turn men around, has seen his patients go from the far fringes of fantasy, pornography, and prostitution to surprisingly conventional existences, picket fence and all.
Sex addicts for whom Prozac allows a normal life, most of whom are men, generally tend to be grateful…

So here we have a drug giving people the willpower to lead decent lives. I think this is remarkable.

Conversely, the author says this about Zyprexa, a drug that she takes because the alternative is worse:

…the problem with Zyprexa was that it so intensified my appetite that it was beyond satiation, so that the mere mention of food caused my mouth to water….As the Zyprexa toyed first with my metabolism and then with my body, my weight went up, up, up, with the end result that I am now an overweight diabetic. High blood sugar is destroying my eyesight, so that without glasses everything looks fuzzy….My high sugar has also caused my kidneys to malfunction…

To sum up: we have a drug (Zyprexa) that reduces willpower, or drowns it out with a strong urge to eat. You could say it causes a character flaw.
Prozac increases the levels of serotonin at synapses. Zyprexa blocks or lessens the effects of dopamine and serotonin. It is not known if these are the only effects, or why those effects have curative properties.

We like to believe we are in charge – but if a chemical can make us a slave to our impulses – or conversely free us from impulses, what does this say for human responsibility, guilt, and shame?

Serotonin, the neurotransmitter directly affected by Zyprexa, also occurs in wasps, earthworms, and lobsters. The author of “12 Rules For Life”, Jordan Peterson, studied lobsters, and noted that they had a social hierarchy, and if you were a lucky respected lobster, you had high serotonin.

Lauren Slater points out that serotonin interacts with many other neurotransmitter systems, and so why drugs such as Zyprexa or Prozac work at all is not known.

I’m not trying here to make excuses for every out-of-control rascal, (“I was made that way, I can’t help it), but it is interesting that one of our presidents, JFK, when asked why he pursued so many ‘affairs’, replied “I can’t help it”. Maybe that wasn’t just an excuse!

(note: Lauren Slater says there is no proof that these drugs address a simple ‘neurotransmitter imbalance’.  Nonetheless, whatever is going on is being alleviated by a single chemical taken in pill form.)

Unraveling The Mystery of How The Brain Makes The Mind – Michael Gazzaniga’s new book

Michael Gazzaniga, who directs the SAGE institute for the study of the mind at UC Santa Barbara, has come out with a book on how the brain gives rise to the mind. He believes that consciousness is not tied to a specific neural network. Various functions that take place in the brain each have intrinsic consciousness. So what is his proof? One line of evidence is that when the cable connecting the left and right hemisphere is cut:

…the left hemisphere keeps on talking and thinking as if nothing had happened even though it no longer has access to half of the human cortex. More important, disconnecting the two half brains instantly creates a second, also independent conscious system. The right brain now purrs along carefree from the left, with its own capacities, desires, goals, insights and feelings. One network, split into two, becomes two conscious systems.

Patients with a lesion in a particular part of the right hemisphere will behave as if part or all of the left side of their world…does not exist!

This could include not eating off the left side of their plate, not shaving…on the left side of their face…not reading the left pages of a book, etc.

It seems that when you lose a capacity, you are not even cognizant of what you’ve lost. You don’t know what you don’t know.

In a section titled “What is it like to be a Right Hemisphere”, Gazzaniga says that you would not notice the loss of the left hemisphere. You would have trouble communicating, because you lost your speech center. You would also lose your ability to make inferences.

“Though you would know that others had intentions, beliefs and desires, and you could attempt to guess what they might be, you would not be able to infer cause and effect. You would not be able to infer why someone is angry or believes as they do.


Not only do you not infer that your neighbor is angry because you left the gate open and her dog got out, you don’t infer that the dog got out because you left the gate open

On the positive side, without a left hemisphere:

You won’t be a hypocrite and rationalize your actions…You would have no understanding of metaphors…

This is interesting also if you look at my prior post in this blog on the causal diagrams of Judea Pearl. He says that the causal diagrams he draws must be similar to how we represent them internally, and yet, Gazzaniga is saying that only one hemisphere has the ability to think about causes.

In a prior post, I mentioned that: “V. S. Ramachandran’s studies of anosognosia reveal a tendency for the left hemisphere to deny discrepancies that do not fit its’ already generated schema of things. The right hemisphere, by contrast is actively watching for discrepancies, more like a devil’s advocate. These approaches are both needed, but pull in opposite directions. ”

So I suppose if you were a right hemisphere, your ways of thinking would be very different than if you were a left hemisphere.

In fact, Gazzaniga gives fascinating illustrations of such differences.  Suppose you show a film of a ball coming toward another ball, not quite touching, but the second ball acting as if it was hit and taking off.   The left hemisphere of a split-hemispheres patient will fall for the illusion even if there is a delay in the second ball moving, or if the distance is increased between the balls.  The right hemisphere eventually sees that it is an illusion.  On the other hand, the left hemisphere will solve problems with logical inference that the right hemisphere can’t.   It seems that the right-hemisphere can PERCEIVE causality, but the left hemisphere can INFER causality.

Rebecca Saxe at MIT found that the right half brain  has special ‘hardware’ to determine the intentions of other people.   (This ability is also known as ‘theory of mind”).  Based on this finding a former student of Gazzaniga’s Michael Miller, and a philosopher (Walter Sinnot-Armstrong) had the idea of looking at split brain patients to see how each hemisphere evaluated moral responsibility.  They presented scenarios such as this one:

If a secretary wants to bump off her boss and intends to add poison to his coffee, but unknown to here, it actually is sugar, he drinks it, and he is fine, was that permissible?

The left hemisphere judges the secretary as blameless, despite the malicious intent, since no harm was done.   The right hemisphere would not agree – it would take ‘intent’ into account.  But the judgement of the right hemisphere is not available to the left hemisphere, because the communication cable between the hemispheres is cut.  There may be some type of communication though – the right hemisphere has a ‘bad feeling’ about the situation, and that feeling is available via older brain areas to both hemispheres.  So the left hemisphere may feel impelled to explain its decision.

Gazzaniga writes that even though your experience seems a coherent, flawlessly edited film, it is instead

a stream of single vignettes that surface like bubbles a boiling pot of water, linked together by their occurrence of time.

This raises a question:

Do the bubbles burst willy nilly, or are they the product of a dynamic control system?  Is there a control layer giving some bubbles the nod and quashing others?

He gives several options:

  1. competition.  If you bite into bitter chocolate, no module that processes sweet sensations is activated.   Bitter information is being processed.  If you eat milk chocolate instead, then the ‘sweet module’ is up and running as well, and outcompetes bitterness by a landslide.
  2. Top down expectation: you are searching for the one person in the crowd with a red flower in her hair
  3. Arbitrary rules:   For instance, if you have been told that low-fat diets are the road to health, then bubbles will come up as you shop for groceries, guiding you to stay clear from various products.  If you then read that ‘fat’ is actually good for you, you will get different bubbles.

At one point he also suggests that the ‘bubbles’ are linked by the feelings they engender as well, and a possibility that occurs to me is that the ‘interpreter’ in our brain – an area with the function to make a story out of our thoughts and make sense of our own actions, might also link these experiences.

There is much more to the book.
One analogy he makes that I have not heard before is for the hierarchical sets of layers in the cortex. The analogy he makes is to the screen of a computer. When you interact with a computer, as far as your experience, you don’t interact with circuits and registers and bits, you interact with icons and pictures and words on a screen, using a keyboard and a mouse. The details are abstracted for you. The same goes for the hierarchies of layers in the brain – each layer just passes an abstracted version to the next.

After finishing the book I’m still left with a mystery.   Suppose a ‘bubble’ outcompetes all others, and is the focus of my mind.   And each bubble has intrinsic consciousness that goes with it.   How is that subjective consciousness felt and experienced?  Where is the qualia of a bright red sunset? Where is our feeling of 3-D space? What does it mean to feel we are exerting a force, such as a force of willpower not to have dessert? And where does the inner voice come from that says something is wrong with a story you’ve just heard, but you can’t pinpoint yet what it is? Is it a module trying to get into consciousness?

The Consciousness Instinct – Unraveling The Mystery of How The Brain Makes The Mind – Michael S. Gazzaniga (2018)

Judea Pearls causal revolution – and implications for A.I.

Judea Pearl recently wrote a book for a popular audience about his life work, called “The Book of Why”. Pearl is the inventor of “Bayesian Networks”, which are graphs whose links are probabilities. Such a graph might have some nodes that represented symptoms of a disease, and other nodes that represented various diseases.  The links between nodes, as well as the decision of which nodes link to which, can help diagnose which disease a person has, given his symptoms. The probabilities propagate using “Bayes Rule”. Despite the huge success of Bayesian Networks, Professor Pearl was not satisfied, so he invented causal networks and a causal language to go with them. This latter discovery has big implications for machine learning.

In what follows, I’ll assume you know Bayes Rule, and I give a flavor of what he accomplished, based on the book.

The weakness of associations in probability are that they don’t tell you what caused what. The rooster may crow just before sunrise, but the association doesn’t tell you whether the approaching sunrise caused the rooster to crow or the rooster’s crowing caused the sunrise.

Lets start with some interesting aspects of Bayesian Graphs.


The simplest Bayes net would be a junction between two nodes that is updated via Bayes Rule. But lets look at junctions that involve 3 nodes (which could exist in a huge graph of hundreds of nodes). The following is a quote from the book:

There are three basic types of junctions, with the help of which we can characterize any pattern of arrows in the network.
1. A -> B -> C. This junction is the simplest example of a “chain,” or of mediation. In science, one often thinks of B as the mechanism, or “mediator,” that transmits the effect of A to C. A familiar example is
Fire -> Smoke -> Alarm.
Although we call them “fire alarms,” they are really smoke alarms. The fire by itself does not set off an alarm, so there is no direct arrow from Fire to Alarm. Nor does the fire set off the alarm through any other variable, such as heat. It works only by releasing smoke molecules in the air. If we disable that link in the chain, for instance by sucking all the smoke molecules away with a fume hood, then there will be no alarm. This observation leads to an important conceptual point about chains: the mediator B “screens off” information about A from C, and vice versa. Suppose we had a database of all the instances when there was fire, when there was smoke, or when the alarm went off. If we looked at only the rows where Smoke = 1 (i.e. TRUE), we would expect Alarm = 1 every time, regardless of whether Fire = 0 (FALSE) or Fire = 1 (TRUE).
2. A < — B — > C. This kind of junction is called a “fork,” and B is often called a common cause or confounder of A and C. A confounder will make A and C statistically correlated even though there is no direct causal link between them. A good example (due to David Freedman) is Reading Ability.
Children with larger shoes tend to read at a higher level. But the relationship is not one of cause and effect. Giving a child larger shoes won’t make him read better! Instead, both variables are explained by a third, which is the child’s age. Older children have larger shoes, and they also are more advanced readers. We can eliminate this spurious correlation, as Karl Pearson and George Udny Yule called it, by conditioning on the child’s age. For instance, if we look only at seven-year-olds, we expect to see no relationship between shoe size and reading ability.
3. A — > B < — C an example: Talent — > Celebrity < — Beauty. Here we are asserting that both talent and beauty contribute to an actor’s success, but beauty and talent are completely unrelated to one another in the general population. We will now see that this collider pattern works in exactly the opposite way from chains or forks when we condition on the variable in the middle. If A and C are independent to begin with, conditioning on B will make them dependent. For example, if we look only at famous actors (in other words, we observe the variable Celebrity = 1), we will see a negative correlation between talent and beauty: finding out that a celebrity is unattractive increases our belief that he or she is talented. This negative correlation is sometimes called collider bias or the “explain-away” effect. For simplicity, suppose that you don’t need both talent and beauty to be a celebrity; one is sufficient. Then if Celebrity A is a particularly good actor, that “explains away” his success, and he doesn’t need to be any more beautiful than the average person. On the other hand, if Celebrity B is a really bad actor, then the only way to explain his success is his good looks. So, given the outcome Celebrity = 1, talent and beauty are inversely related—even though they are not related in the population as a whole. Even in a more realistic situation, where success is a complicated function of beauty and talent, the explain-away effect will still be present. .

The miracle of Bayesian networks lies in the fact that the three kinds of junctions we are now describing in isolation are sufficient for reading off all the independencies implied by a Bayesian network, regardless of how complicated.

Pearl then explains that Bayes Rule lets you update your belief in a hypothesis when new data is presented. It lets you calculate a backward probability, given a forward probability:

Suppose you take a medical test to see if you have a disease, and it comes back positive. How likely is it that you have the disease? For specificity, let’s say the disease is breast cancer, and the test is a mammogram. In this example the forward probability is the probability of a positive test, given that you have the disease: P(test | disease). This is what a doctor would call the “sensitivity” of the test, or its ability to correctly detect an illness. Generally it is the same for all types of patients, because it depends only on the technical capability of the testing instrument to detect the abnormalities associated with the disease. The inverse probability is the one you surely care more about: What is the probability that I have the disease, given that the test came out positive? This is P(disease | test), and it represents a flow of information in the non-causal direction, from the result of the test to the probability of disease. This probability is not necessarily the same for all types of patients; we would certainly view the positive test with more alarm in a patient with a family history of the disease than in one with no such history. Notice that we have started to talk about causal and non-causal directions.

For the next quote, you should understand that we can rewrite Bayes’s rule as follows: (Updated probability of Disease once the test results are in) = P(D | T) = (likelihood ratio) × (prior probability of D) where the new term “likelihood ratio” is given by P(T | D)/P(T).

Judea Pearl was reading about neural network models of the brain, and he put that together with Bayes Rule when he first planned how Bayesian Networks would work.

I assumed that the network would be hierarchical, with arrows pointing from higher neurons to lower ones, or from “parent nodes” to “child nodes.” Each node would send a message to all its neighbors (both above and below in the hierarchy) about its current degree of belief about the variable it tracked (e.g., “I’m two-thirds certain that this letter is an R”). The recipient would process the message in two different ways, depending on its direction. If the message went from parent to child, the child would update its beliefs using conditional probabilities,… If the message went from child to parent, the parent would update its beliefs by multiplying them by a likelihood ratio, as in the mammogram example.

In other words, ‘forward probability’ (Test | Disease) would be passed down, backward probability (Disease | Test) would be passed up.
In image recognition of a word, the probability of a word being “Lion” might be increase by a message passed up to a parent that there is more evidence that the first letter is “L”, and in turn, the more evidence there is for “Lion”, the more probability given to the message passed downward that the first letter is “L”.

So why didn’t Professor Pearl rest on his laurels? He states some limitations with Bayesian networks:

By design, in a Bayesian network, information flows in both directions, causal and diagnostic: smoke increases the likelihood of fire, and fire increases the likelihood of smoke. In fact, a Bayesian network can’t even tell what the “causal direction” is…

It is one thing to say, “Smoking causes cancer,” but another to say that my uncle Joe, who smoked a pack a day for thirty years, would have been alive had he not smoked. The difference is both obvious and profound: none of the people who, like Uncle Joe, smoked for thirty years and died can ever be observed in the alternate world where they did not smoke for thirty years. Responsibility and blame, regret and credit: these concepts are the currency of a causal mind. To make any sense of them, we must be able to compare what did happen with what would have happened under some alternative hypothesis.

A structural model, as he diagrams it, looks simple. Here is one:


This is based on a causal model that a doctor named John Snow in England used in 1854 when there was a cholera outbreak. Dr. Snow trudged around town and realized that people downstream from a water company were getting sick. The belief among ‘experts’ at the time was that some atmospheric ‘miasma’ caused cholera, and that idea is also incorporated in the diagram. ‘Poverty’ probably really did have an effect on both water purity and likelihood of cholera (as you can also see in the diagram).

So what’s so great about the diagram shown above?

First, structural causal models are a shortcut that works, and there aren’t any competitors around with that miraculous property. Second, they were modeled on Bayesian networks, which in turn were modeled on David Rumelhart’s description of message passing in the brain.

Professor Pearl has an interesting speculation at this point:

It is not too much of a stretch to think that 40,000 years ago, humans co-opted the machinery in their brain that already existed for pattern recognition and started to use it for causal reasoning…

[A.I. researchers] aimed to build robots that could communicate with humans about alternate scenarios, credit and blame, responsibility and regret. These are all counterfactual notions that AI researchers had to mechanize before they had the slightest chance of achieving what they call “strong AI”—humanlike intelligence.

Pearl and his students created a mathematical language of causality. For instance, they could represent “counterfactuals” in this language. A counterfactual is an alternative that was not taken. For instance, “if only I had not left my Facebook page open to that joke I made about my wife when she was still in the house”.

Pearl writes:

The case for causal models becomes even more compelling when we seek to answer counterfactual queries such as “What would have happened had we acted differently?” any query about the mechanism by which causes transmit their effects—the most prototypical “Why?” question—is actually a counterfactual question in disguise. Thus, if we ever want robots to answer “Why?” questions or even understand what they mean, we must equip them with a causal model and teach them how to answer counterfactual queries.

Belief propagation formally works in exactly the same way whether the arrows are non-causal or causal. So Bayes Nets and causal diagrams have similarities However causal models are assumptions – and that is where their extra power comes from.


The above “ladder of causation” illustration shows that without a causal model, statistics stay only on the bottom rung of the ladder.

Remember this diagram from above:


Pearl says of it:

John Snow’s painstaking detective work had showed two important things: (1) there is no arrow between Miasma and Water Company (the two are independent), and (2) there is an arrow between Water Company and Water Purity. Left unstated by Snow, but equally important, is a third assumption: (3) the absence of a direct arrow from Water Company to Cholera, which is fairly obvious to us today because we know the water companies were not delivering cholera to their customers by some alternate route. … Because there are no confounders of the relation between Water Company and Cholera, any observed association must be causal. Likewise, since the effect of Water Company on Cholera must go through Water Purity, we conclude (as did Snow) that the observed association between Water Purity and Cholera must also be causal. Snow stated his conclusion in no uncertain terms: if the Southwark and Vauxhall Company had moved its intake point upstream, more than 1,000 lives would have been saved. Few people took note of Snow’s conclusion at the time. He printed a pamphlet of the results at his own expense, and it sold a grand total of fifty-six copies.

I don’t have room to go into the causal language and elegant equations Pearl has in his book, but I’ll mention the ‘do’ operator – the idea of actively changing a cause, as opposed to just observing:

if we are interested in the effect of a drug (D) on lifespan (L), then our query might be written symbolically as: P(L | do(D)). In other words, what is the probability (P) that a typical patient would survive L years if made to take the drug? This question describes what epidemiologists would call an intervention or a treatment and corresponds to what we measure in a clinical trial. In many cases we may also wish to compare P(L | do(D)) with P(L | do(not-D)); the latter describes patients denied treatment, also called the “control” patients. The do-operator signifies that we are dealing with an intervention rather than a passive observation; classical statistics has nothing remotely similar to this operator. We must invoke an intervention operator do(D) to ensure that the observed change in Lifespan L is due to the drug itself and is not confounded with other factors that tend to shorten or lengthen life. If, instead of intervening, we let the patient himself decide whether to take the drug, those other factors might influence his decision, and lifespan differences between taking and not taking the drug would no longer be solely due to the drug. For example, suppose only those who were terminally ill took the drug. Such persons would surely differ from those who did not take the drug, and a comparison of the two groups would reflect differences in the severity of their disease rather than the effect of the drug.

… Note that P(L | D) may be totally different from P(L | do(D)). This difference between seeing and doing is fundamental …A world devoid of P(L | do(D)) and governed solely by P(L | D) would be a strange one indeed. For example, patients would avoid going to the doctor to reduce the probability of being seriously ill; cities would dismiss their firefighters to reduce the incidence of fire…

Pearl notes that we can say that X causes Y if P(Y | do(X)) > P(Y). So here we have a mathematical formula, in his new causal language, for causality!

We could think up other applications for Pearl’s causal diagrams and language. Perhaps indexing of pages by search engines on the internet would benefit by first parsing the causal structure of the story in the document. Perhaps arguments, political or otherwise, could be analyzed to show their causal assumptions.

Pearl says that current convolution nets and deep-learning nets leave out causality. I would mention though that there are also “generative models” that do learn causes. You can test hypotheses about neuronal time-series with a method called “dynamic causal modelling, and in “free energy” based nets, which I mentioned in an earlier blog post, the application of the free energy principle results in a generative model that generates consequences from causes.

But apart from that minor point, I would think the application of Pearl’s causal language and diagrams will give a boost to neural models of all sorts as well as reasoning and logic. His causal diagrams also are easy for a human to understand, unlike the internal weights of recurrent generative nets.


Pearl, Judea. The Book of Why: The New Science of Cause and Effect. Basic Books. Kindle Edition.

Using word context in documents to predict brain representation of meaning.

In an article titled Predicting Human Brain Activity Associated with the Meanings of Nouns, Tom Mitchell of Carnegie Mellon University and associates describe how they demonstrated how meanings of words are mapped in the brain.
The first step was to get examples of words in documents. They obtained a trillion-word collection.

It has been found that a word’s meaning is captured to some extent by the distribution of words and phrases with which it commonly co-occurs. For instance, the word ‘soup’ will often occur in documents that also have the word ‘spoon’, ‘eat’ and ‘dinner’. It would not be as likely to co-occur with words like ‘quasar’ or ‘supernova’. They then used the statistics they had gathered as follows:
Given an arbitrary stimulus word w, they encoded the meaning of w as a vector of intermediate semantic features. For example, one intermediate semantic feature might be the frequency with which w co-occurs with the verb ‘hear.’ The word ‘ear’ in the trillion word text corpus would presumably co-occur with a high frequency with ‘hear’, while the word ‘toe’ would have less of a co-occurrence frequency.

They limited themselves to 25 semantic features, so they had to pick and choose among words to use and they ended up with 25 verbs:
“see,” “hear,” “listen,” “taste,” “smell,” “eat,” “touch,” “rub,” “lift,” “manipulate,” “run,” “push,” “fill,” “move,” “ride,” “say,” “fear,” “open,” “approach,” “near,” “enter,” “drive,” “wear,” “break,” and “clean.”
These verbs generally correspond to basic sensory and motor activities, actions performed on objects, and actions involving changes to spatial relationships.

For each verb, the value of the corresponding intermediate semantic feature for a given input stimulus word w (they used nouns for ‘w’)  is the normalized co-occurrence count. For example if the word ‘ear’ occurs near ‘hear’ 1000 times in trillion word corpus, but occurs in general 3000 times, then the normalized value would be 1/3.  Then on top of that, the vector of features was normalized to unit-length, which means that if you plotted the vector in the 25 dimensions, the length of the vector would be ‘1’.

The next step was to take a ‘functional MRI” scan of the brain of volunteers being exposed to various verbs. For instance, the experimenters would obtain an MRI of the response to the word ‘hear’, which we’ll say for this example has a value for the semantic feature of ‘hear’ of 1/3. They separated the fMRI picture to a large number of tiny packed cubes called ‘voxels’ (analogous to a small square pixels in a 2-dimensional picture) and used regression to get a model that in this case associated the value (1/3) of the semantic feature ‘hear’ with every voxel.  So if they had N voxels, they would now have N weights that the regression had found associating ‘hear’ with the voxels.  They did this for all 25 features in the feature vector.

Once the model had learned correlations with voxels of a the feature vectors for a train set of words, they tried to predict what an fMRI would look like for new nouns – words that they had not trained the model on. They found that the result predicted fMRI neural activity well enough that it could successfully match words it had not yet encountered , with accuracies far above those expected by chance.

In a prediction, every voxel is a weighted sum of the 25 features in the presented word. Recall that each feature is obtained by co-occurrence statistics.

They caution that the assumption that brain activation is based on a weighted linear sum of contributions from each of its semantic features is debatable (the relation might not be linear) but they still got good results using that assumption. They also note that they are correlating features with an entire MRI, and then observing which areas seem to especially light up for a given meaning.

What would the vector for the word ‘hear’ look like? One of the features in that vector actually is the word ‘hear’, so its value would be one, and all the other values  in the vector would be zero.

You can present this word to the model, and see where it predicts the MRI would be activated.

They found that:

the learned fMRI signature for the semantic feature “eat” predicts strong activation in opercular cortex, which others have suggested is a component of gustatory cortex involved in the sense of taste. Also, the learned fMRI signature for “push” predicts substantial activation in the right postcentral gyrus, which is widely assumed to be involved in the planning of complex, coordinated movements. Furthermore, the learned signature for “run” predicts strong activation in the posterior portion of the right superior temporal lobe along the sulcus, which others have suggested is involved in perception of biological motion.
To summarize, these learned signatures cause the model to predict that the neural activity representing a noun will exhibit activity in gustatory cortex to the degree that this noun co-occurs with the verb “eat,” in motor areas to the degree that it co-occurs with “push,” and in cortical regions related to body motion to the degree that it co-occurs with “run.”

They also find that different people have commonalities in how they represent meanings, though it would be interesting to see if there were differences in very abstract words – for instance, does a group of Marxists represent ‘justice’ differently than a group of conservative economists?

They did find some differences even with concrete words:

in some cases the correspondence holds for only a subset of the nine participants. For example, additional features for some verbs for participant P1 include the signature for “touch,” which predicts strong activation in somatosensory cortex (right postcentral gyrus), and the signature for “listen,” which predicts activation in language-processing regions (left posterior superior temporal sulcus and left pars triangularis), though these trends are not common to all nine participants.

Each of the 25 features can be thought of as a dimension, just like an X/Y axis is used for 2 dimensions.


On each of the 25 axes is a value between 0 and 1, and together, they make a vector that ends at a point of distance 1 from the origin. That point is the location of a word in ‘meaning space’. In linear algebra, we know that ideal x and y axes are perpendicular (90 degrees and totally independent from each other), but even if they are not, as long as the angle between them is not zero, you can generate a 2D space from combinations of numbers on each axis that contains all the vectors as if the bases had been independent. In any event, Mitchell’s group did test some other choices of features, but the predictions of MRI scans using those features were not as good as the set mentioned above.

The next image shows part of a vector for ‘celery’ and for ‘airplane’


A co-occurrence method is not the only way to represent a meaning. A method used by Gustavo Sudre, who worked with Tom Mitchell, was using context (verb plus object for instance in a verb phrase) and parts of speech.


Another way to represent meaning is like the game ’20 questions’. You ask the same questions about a set of words, some will have an answer of 1 (yes) and some will have an answer of 0 ‘no’. So you can get a vector that captures some aspects of meaning that way as well.
When trying to decode a noun, using 218 questions, it was found that some questions were better than others.   It was also found that they often fell into 3 categories – size, manipulability, and animacy (alive or not)

This next image is interesting – Gustavo Sudre found that when a noun is represented in the brain, different semantic aspects peak at different times. Different parts of the brain would handle groups of semantic aspects, for instance, a motor area might handle the ‘move’ aspect of the word ‘push’. This raises a question of how all the relevant parts of the word are remembered and used – since they don’t peak at the same instant. One idea – at least as far as sentences, is that different aspects of meaning are part of a cyclical oscillation. To me that’s not a sufficient explanation. The different parts of meaning have to interact, and create a whole, in some way. But here is the image:
Mitchell’s student Leila Wehbe has also done work with entire sentences and fMRI, but that will be a subject for another post.
Sources: (Tom Mitchell video)

Finding invariant object representations by using the thalamus as a blackboard.

Randall C. O’Reilly, Dean R. Wyatte, and John Rohrlich at the Department of Psychology and Neuroscience of the University of Colorado Boulder have come up with a new model of how the brain learns. As they put it in their article titled “Deep Predictive Learning”:

where does our knowledge come from? Phenomenologically, it appears to magically emerge after several months of[an infant] …gaping at the world passing by — what is the magic recipe for extracting high-level knowledge from an ongoing stream of perceptual experience?

Their model has several basic elements.
There is an ‘error’ signal that reflects the discrepancy between why the brain expects to happen, and what actually is experienced.
That idea is not new, in fact there are many hierarchical models (known as Hierarchical Generative Models) where ‘predictions’ flow down the hierarchy, but sensory data flows up. (The higher level of the hierarchy generally represents more abstract features of the input, so if a low level represents edges between dark and light, a higher level might represent a shape such as a cylinder) The two flows subtract at a group of error neurons at each level. For example, if level 1 is the lowest level, it might receive sensory input from the eyes, but also receives a prediction from the next higher level, level 2. The two signals subtract at a specialized group of neurons in level 1 called ‘error’ neurons, and these then pass the error signal (mismatch between predictions and actual sensory input) up to level 2. (In this model level 2 also gets the processed sensory information passed up from level 1). Level 2 corrects its connections based on a learning algorithm, using the error information. By this process level 2 learns to represent the layer below it (level 1).

The same thing happens between level 2 and level 3, here level 2 passes up abstracted sensory information, level 3 passes down its predictions, and error neurons on level 2 compute the mismatch between the two. The error neurons then pass up the error to level 3 for level 3 to use in refining its predictions for next time. Eventually level 3 learns, in an abstract way to represent level 2.

It is typical to train a model like this from the bottom up, so first layer 2 learns to represent layer 1, then layer 3 learns to represent layer 2, etc.

The Boulder group see things differently, however. In their model, the various successive levels in the vision hierarchy (V1, V2, V3, etc.) all pass sensory information (or abstractions of sensory information) to an area of the thalamus called the ‘pulvinar’. The pulvinar is like a blackboard in a classroom, where each student has reserved his own small part of the blackboard to make a note or draw a picture. Each student can see the entire blackboard but can only modify his own part of it. The analogy to a classroom doesn’t work too well though, because in their model the blackboard is showing error – or mismatch – between sensory information and predictions of what that information should be. For instance, one section of the blackboard might be comparing the abstracted sensory information from level V2 with the prediction from level V2.

An interesting aspect of this model is that a higher level in the visual hierarchy gets error information from all the levels below it, not just from the layer immediately below it. It sends both predictions and actual information to just one part of the pulvinar but gets back error signals from all parts of the pulvinar.
In this mode, the highest level abstract information is also sent via the pulvinar to the lowest layer.
The comparison between sensory signals and predicted signals uses a time difference. The past is used to predict the future. For instance, in a 100-millisecond interval, the first 75 milliseconds is used to send a prediction signal to the pulvinar neurons, and then the next 25 milliseconds are used to send the current (abstracted) sensory information to those same neurons. The first 75 milliseconds altered the properties of those neurons – perhaps their thresholds adapted, and the input of the next 25 milliseconds is decremented by this new threshold. That neural firing, which is a difference between predicted and sensory then is fed back from the pulvinar toward the various visual levels. The first 75 millisecond prediction signal is called the ‘minus’ phase, and the second ‘ground truth’ signal is called the ‘plus’ phase. (minus comes before plus, predictions come before ‘ground truth’ signals that they are then compared with). The model thus explains why the brain has an ‘alpha’ rhythm, which repeats 10 times a second (so each cycle is 100 milliseconds long). Both the deep layer and the pulvinar show this rhythm. The superficial layers do not show the rhythm.
In an infant, some areas of the brain work before others.
The infant may see a blob moving, without the details, but this is enough to provide an error mismatch as to ‘where’ something is, and the ‘where’ error corrects the spatial movement predictions to the point that now the brain can concentrate on the ‘what’. The authors write:

Unlike the formation of invariant object identity abstractions (in the What pathway), spatial location (in retinotopic coordinates at least) can be trivially abstracted by simply aggregating across different feature detectors at a given retinotopic location, resulting in an undifferentiated spatial blob. These spatial blob representations can drive high level, central spatial pathways that can learn to predict where a given blob will move next, based on prior history, basic visual motion filters, and efferent copy inputs…

This is a division of labor, where one problem must be solved before another is addressed. (blogger aside – perhaps this has implications for backpropagation nets in general – you could have a few hidden nodes involved in learning one aspect of a scene, then freeze their weights so they don’t change, and then add more hidden nodes to learn other aspects of the scene).

The authors also believe that the need to address problems in order explains a finding – that the lowest layer’s error signal goes, not just to the next higher level, but also to the highest levels. They write:

We have consistently found that our model depends critically on all areas at all levels receiving the main predictive error signal generated by the V1 layer 5IB driver inputs to the pulvinar in the plus phase. This was initially quite surprising at a computational level, as it goes strongly against the classic hierarchical organization of visual processing, where higher areas form representations on top of foundations built in lower areas — how can high-level abstractions be shaped by this very lowest level of error signals? We now understand that the overall learning dynamic in our model is analogous to multiple regression, where each pathway in the model learns by absorbing a component of the overall error signal, such that the residual is then available to drive learning in another pathway. Thus, each factor in this regression benefits by directly receiving the overall error signal…

(In multiple regression, the main component of a plotted curve is isolated first, and then secondary and then tertiary components)

So how does the mechanism work? The cortex has six layers. layer 1 is closest to the skull, layer 6 is deepest. So, we call layers 2,3, and 4 superficial, and layers 5 and 6 ‘deep’.
In the deep predictive model the superficial layers reflect the current state of the environment while the deep layers generate predictions. A region such as V1, which is the lowest in the visual hierarchy, has six layers, and a region such as V3, which is higher, has the same organization of six layers. In the picture below, you can see how a lower level (the bottom square with the 3 colors) interacts with a higher level (the top square). In both squares, blue stands for ‘superficial layers’, pink stands for ‘deep layers’ The green layer is also considered part of the superficial layers, but it is shown separately because it receives inputs.
You can see that the feedforward signals go from the superficial layers (of the lower layer) to layer 4 in the upper level. But there are also feedback connections. The superficial layers of the top level go down and feed into both superficial and deep layers in the lower level. The deep layers of the top level also send feedback to these two areas. But most feedforward projections originate from superficial layers of lower areas and deep layers predominantly contribute to feedback.


For the deep levels to generate predictions, they must temporarily not be updated by outside information. If you are playing the game “pin the tail on the donkey”, you wear a blindfold so that you won’t see where the donkey is pasted to the wall. You can’t make an honest prediction if you peek under the blindfold.
The authors put it like this:

Well-established patterns of neocortical connectivity combine with phasic burst firing properties of a subset of deep-layer neurons to effectively shield the deep layers from direct knowledge of the current state, creating the opportunity to generate a prediction. Metaphorically, the deep layers of the model are briefly closing their “eyes” so that they can have the challenge of predicting what they will “see” next. This phasic disconnection from the current state is essential for predictive learning (even though it willfully hides information from a large population of neurons, which may seem counter-intuitive) …

Suppose we just focus on region V2 and the region below it, V1. The deep layer of V2, layer 6, which has been shielded for 75 milliseconds from any real-world corrections, sends its predictions to the pulvinar. Shortly after, for a duration of 25 milliseconds, V1’s own deep layer 5 (not 6) sends actual real values to the pulvinar via 5IB neurons. At the pulvinar the two projections impinge on error neurons which calculate a mismatch and send that mismatch back via projections not only to V2, but also to higher layers, which then use an algorithm (equivalent to backpropagation) to learn from the error.
(It was long thought that backpropagation was not realistic in the brain, but if you consider that many connections in the cortex are bidirectional, you can see that the ‘error gradients’ of backpropagation do have a pathway to pass backwards, the opposite direction of feedforward inputs. The Colorado group has an online book where they go into detail on this (
Recall that V2’s deep layer has been shielded from the real world for 75 milliseconds. Now it is time to briefly ‘unshield’ it, and that is done by having V2’s superficial layer update the content of the V2 deep layer.

In the next image, the ‘prediction’ or ‘minus’ phase is happening first. If you start at the middle row of boxes (which represent the pulvinar), it sends an error that is obtained from comparing its minus and plus phase to both the superficial and deep layers (in this case of V2). At the last 25 milliseconds, the 5IB neurons (diagonal line going down to the right from the superficial layers of V2) update the deep layer of V2. These same 5IB cells drive a plus phase in a higher area of the pulvinar (in this diagram, you can see at the bottom, V1 5IB neurons sending plus phase information to the area of the pulvinar that will in turn send error information to V2)

The authors believe that during the next 75 milliseconds, the deep layer is remembering its inputs by a reverberating signal. The superficial layers, which are not shielded from the environment, are attempting to settle down as a ‘Hopfield Network’ might – trying to satisfy all the conflicting constraints of information that arrives from the senses as well as the information that arrives from higher level superficial layers. An example of this behavior by the superficial layers is when you look at the ‘Dalmatian illusion’ – its hard to figure out what the picture shows unless someone tells you it’s a Dalmatian, in which case top-down information combines with the information you see to settle on the correct interpretation.


We just said that the deep layer has to be isolated from the outside world, part of the time. In the diagram with the 3 colors above, we see that the ‘superficial’ layer of the low level, which represents ‘current world’ information, does not feed into the ‘deep’ layer of the level above it. This makes sense, because the ‘deep’ level is making the prediction, and we don’t want that ‘prediction’ to be short-circuited by real-world information.
It might seem contradictory though that there is a feedback from the superficial layer of the higher level to the deep layer of the lower level. After all, that superficial layer is also carrying ‘current real world’ information. The authors explain that this layer is more abstract, so it doesn’t carry real-world details that have to be predicted at the lower level.

There is a paradox in this model:
To predict that in the next instant, a cat will move its paw toward a ball of string, you need to feed high level information about the shape of a cat and the properties of a cat. If you are going to learn at the lowest level, V1, what low level features to expect next when seeing part of the visual field that has a cat in it, it would help to already know what a cat was. But on the other hand, how can we develop the abstract generalization of “cat” when we don’t yet know anything about fur, paws, teeth, etc?
As they put it:

How can high-level abstract representations develop prior to the lower-level representations that they build upon?
In other words, this is a case where “it takes the whole network to raise a model”—the entire predictive learning problem must be solved with a complete, interacting network, and cannot be solved piece-wise.

The next two diagrams are from their paper. The first shows various levels divided into superficial (blue) and deep (pink). If you look at ‘V1’ for example, you see that its superficial layer feeds into the superficial layer of V2. The main thing to notice is that deep layers don’t receive input from superficial layers of levels that are lower than the level they are part of. Also the superficial layer of TEO, a level at the top of the hierarchy, sends feedback to a level near the bottom – the deep layers of level V2, and V2’s superficial layers sends a signal directly to a level at the top (MT). TEO’s deep layer sends a feedback signal to V2’s superficial layer as well. The original caption for this figure states: “Special patterns of connectivity from TEO to V3 and V2, involving crossed super-to-deep and deep-to-super pathways, provide top-down support for predictions based on high-level object representations (particularly important for novel test items).”


The second diagram shows only the deep section of each level (purple) next to a corresponding section of the pulvinar that it sends projections to. As you can see, the pulvinar that receives V1 information sends out error information via projections to many other areas. The original caption for the figure states: “Most areas send deep-layer prediction inputs into the main V1p prediction layer, and receive reciprocal error signals therefrom. The strongest constraint we found was that pulvinar outputs (colored green) must generally project only to higher areas, not to lower areas, with the exceptions of MTp –> V3 and LIPp –> V2”


The exciting aspect of this model is that without a teacher, it learns to recognize objects, irrespective of movement.
The authors are now looking at modeling 3D objects and also plan to combine auditory information with the visual model they have constructed so far.

Deep Predictive Learning: A Comprehensive Model of Three Visual Streams
Randall C. O’Reilly, Dean R. Wyatte, and John Rohrlich Department of Psychology and Neuroscience University of Colorado Boulder, 345 UCB, Boulder, CO 80309
e-cortex is a company that Randall O’Reilly works at – eCortex research has been commissioned by some of the biggest government and commercial entities with interest in applied brain-based artificial intelligence

to be more accurate, the deep layer represents 100 milliseconds of information, not just 75, because its processing overlaps with the 25 milliseconds of the ‘ground truth signal’ that is sent to the pulvinar.


Creating Neural Nets based on a Free-Energy principle

In 1982 John Hopfield invented a neural net that minimized a set of constraints. If two nodes of the net were meant to go together, they would start off with a positive weight between them. If two nodes were incompatible, they would start with a negative weight. Sometimes this would lead to conflict – For instance, if nodes A and B were supposed to go together, and A and C were also supposed to go together, but A and C were supposed to be incompatible.  All nodes are connected to each other, and send a signal to all their neighbors.

An example of such a conflict is if node A signals the desire to have cake, and node C signals the desire to be thin, and node B signals the desire to attend a big party. You can create a large net with conflicting constraints (fixed weights) and let the activations evolve, using Hopfield’s learning algorithm, until the pattern of activations settles in a minimum where conflict is minimized, and harmony is maximized.    A conflict in such a net would be where two nodes are connected by reciprocal positive weights, but one node has an activation that is negative, and the other has an activation that is positive.  So each node is telling the other that it is ‘wrong’

Hopfield’s algorithm minimizes a measure he calls ‘Energy’. The equation for this is:

The goal is to minimize this energy. There is a measure called “Harmony” which is the above equation without the negative sign so minimizing energy is maximizing harmony. For convenience I’ll use harmony to make this point: when nodes X and Y are both ‘1’ (they can be either 1 or -1) and the weight between them is 1, then the harmony is 1 * 1 * 1, which is the highest it could be for two nodes. So if two values have the same sign, and the weight between them is positive, then they reinforce each other. If X was -1 and Y was 1, and the weight between them was -1, then that also results in a large harmony (-1 * 1 * -1 ==> 1), and that is desired, since values with different signs are incompatible, and the weight between them should reflect that.
In reality there are more than two nodes in most Hopfield nets, and every node is connected to every other node.

It turns out that the “Harmony” function has close relationships to a quantity used in modern day machine learning known variously as an evidence lower bound (ELBO) or variational free energy. In other words, finding the right sort of solution – that satisfies multiple constraints – can be thought of as minimizing free energy. So what is free energy? In thermodynamics, free energy is the energy available to do work, where unavailable energy is known as entropy. This means maximizing Harmony, or minimizing free energy, reduces the energy or tension in a system, while maintaining a high entropy. So what is entropy?

In the physical world, entropy is a measure of disorder – for instance, if you have a group of balls confined in a metal triangle in a pool table, and you remove the triangle and hit the balls with the cue stick, then they can take on many configurations, as they scatter across the table. It would be difficult to hit the balls again and make them organize into the original triangle. In theory it is possible, but the probability would be low. It’s much easier to go the other way.

So you might wonder whether ‘energy’ and ‘entropy’ as defined in the Hopfield net signify anything more than an analogy, though they do use similar math to thermodynamic free energy.   There is even more to the analogy – Hopfield nets traverse energy landscapes as they resolve the conflicts between nodes; very much like the physical force of ‘gravity’ is manifest in a landscape of basins in space-time created by massive planets, for example.

I asked Professor Karl Friston (a professor at the Institute of Neurology, University College London), who has created nets based on a theory of “free energy”, what energy and entropy means. I said that my intuitive idea of energy was that the amount of conflicting constraints is energy, and when all constraints are satisfied, you have an energy minimum.    He replied:

…free energy comes in many guises. It is also called an evidence lower bound (ELBO), bound on integrated or marginal likelihood and so on. The reason it is called free energy is that it can always be decomposed into an energy and entropy term. These can be rearranged into a term that corresponds to the evidence (the probability of data under a particular model) and a KL (Kullback–Leibler) divergence that is a non-negative bound. In a sense, your intuition about energy is correct. In the variational free energy, the energy is the negative log probability of a particular configuration of causes and consequences. Finding the most likely configuration is equivalent to minimizing the energy – much as in statistical thermodynamics.

The entropy means what it says. It is a Shannon entropy of approximate posterior beliefs. This is closely related to Occam’s principle.

The remark comparing entropy to Occam’s principle is interesting. Occam’s principle is the problem-solving principle that, when presented with competing hypothetical answers to a problem, one should select the one that makes the fewest assumptions. In other words, if you can explain something without committing to a highly implausible configuration of causes (e.g., neatly arranged billiard balls and a nice triangle), then it is – provably – a better explanation (in the sense of universal computation and Bayesian or abductive inference).

An article on the Hopfield net defined Hopfield’s energy as: “a measure of constraint satisfaction–when all the constraints are satisfied, the energy is lowest. Energy represents the cumulative tension between competing and cooperating constraints.”

The nets that Professor Friston develop are organized as hierarchies. Sensory information comes at the lowest part of the tree (or hierarchy) and learned predictive information comes down the tree. The two streams meet, and if they don’t match, then the mismatch or error information propagates upward, to change the weights of the net so that it produces better predictions in future. The algorithm tries to minimize free energy, which also means that it tries to minimize surprise – or more simply the mismatch between the data at hand and predictions of those data (i.e. prediction error).   The mismatch is calculated at every level.

Karl’s nets are recurrent – signals don’t just go up the hierarchy, they also go down the hierarchy. So a low level node affects an upper level nodes whose changes then affect the lower level nodes. So mismatches and bad predictions get minimized. The same idea applies to backpropagation in neural networks – a mismatch between desired result and actual result is minimized.  Interestingly, exactly the same sort of architecture is found in machine learning in the form of bottleneck architectures (e.g., variational auto encoders); where (ascending) signals are forced through a ‘complexity minimizing’ bottleneck to fan out again to provide (descending) signals that predict the original input.

If you are interested in how the free energy framework leads to neural nets with simple update rules for each neuron, see the tutorial by Rafal Bogacz in the sources below.

In general this type of net uses ‘‘predictive coding’’, because some of the nodes in the network encode the differences between inputs and predictions of the network. In other words, there are neurons just to measure ‘mismatch’ between prediction and reality at every level in the hierarchy.
If we call the other neurons “state neurons” then these nets are built as follows:

The mismatch neurons pass prediction errors forward to state-units in the higher level and laterally to state-units at the same level. The reciprocal influences of the state on the error-units are mediated by backward connections and lateral interactions. In summary, all connections between error and state-units are reciprocal, where the only connections that link levels are forward connections conveying prediction error to state-units and reciprocal backward connections that mediate predictions.

This architecture that segregates some neurons just for detecting mismatch has been disputed by the group at the University of Colorado led by Randy O’Reilly – and I’ll have a post on his work in a few days.  He actually thinks the error of all levels is projected on a common area in the brain.  On the other hand, there are lots of people are now finding empirical evidence for mismatch architectures distributed throughout the cortical hierarchy (e.g.,

In any event, free-energy ideas work, and working neural nets that control robots have been built on these principles by a company called Brain Corp.

Karl takes the view that we can look at sensory information, and the actions that result, as one system that tries to minimize free energy. The more we understand the world, the more we can predict it, control it, and reduce unpleasant surprises.
It has been objected that the best way to minimize surprise would be to go into a dark room and stay there. Obviously, animals and people do not do that. I asked Professor Friston about that, and he said:

…the imperative to minimize expected free energy is the same as minimizing expected surprise or uncertainty. Heuristically, the best way to minimize uncertainty about the cause of sensations is to switch on the light in the dark room. In other words, the dark room problem is a false problem. (In fact, we tend to avoid dark rooms whenever possible – unless we want to go to sleep! )

There is a paradox that experts, when solving a problem, may show less activity (in a MRI looking at the brain) than novices do. I mentioned that as well, and Professor Friston replied that this is true, and that the way of looking at this was that

the free energy can always be decomposed into accuracy and complexity (by rearranging the energy and entropy terms). People with inefficient models have more complex models and a greater computational complexity cost. Metabolically, this usually manifests as a greater brain activation – until they then acquire a more efficient model or representation of the task at hand.

A skeptic might object that life is full of surprises. Most Americans were very surprised when hijackers flew airliners into skyscrapers in New York, partly because the motives for the attack made no sense. But minimizing surprise on all time scales and experience meant having to accept that these things happened. If we were to reject the facts about that terrorist act, we would have to create even greater surprise – that our media and government conspired to fool us about it. Some people, oddly enough, believe that. Perhaps there is a ‘free energy’ gone wrong with ‘conspiracy theories’?

The idea of minimizing surprise also comes up in various practical applications such as the uses of Information Theory.   It is now more than six decades since Claude Shannon set out this framework, with its  core idea of equating generation of information with reduction of uncertainty (i.e.,“surprise”). Showing that surprise could be quantified in terms of the range of choice applying, Shannon gave the framework mathematical teeth, enabling the extraordinary diversity of applications that continue to this day.

Friston points out also that

Under the free-energy principle, the agent will become an optimal (if approximate) model of its environment. This is because, mathematically, surprise is also the negative log-evidence for the model entailed by the agent. This means minimizing surprise maximizes the evidence for the agent (model). Put simply, the agent becomes a model of the environment in which it is immersed. This is exactly consistent with the Good Regulator theorem of Conant and Ashby (1970). This theorem, which is central to cybernetics, states that “every Good Regulator of a system must be a model of that system.”

People do not share their entire models of the world. We share low level models – for instance, we assume that if we throw a spear, it will come down somewhere, but higher level models – such as how economies work, or our own history – are often quite different. I think this is because there is less room for error on the lower levels. In a practical sense, if you do not realize that gravity works on spears, you are not going to survive very long as a hunter gatherer, but if you believe in Keynesian economics vs your neighbor who believes in Hayek’s economics, you can survive without problem. Much of our reasoning occurs in a world where we can’t have all the answers, and we have to infer plausible bridges into the unknown.

Karl Friston’s website:
A Tutorial on the Free Energy Framework for modeling perception and learning – Rafal Bogacz – Journal of Mathematical Psychology (2015)
Tutorial on Hopfield Networks: (from the University of Minnesota Psychology Department) – note, this uses node activities of 0,1 instead of -1,1 – the Wikipedia article on the topic uses -1,1 – both are valid.

Learning weights between a pair of neurons: a long way past Hebb

While many artificial neural nets use a rule to update their weights that does not involve time, It has been found in the brain that Spike-timing-dependent plasticity (STDP) is common. Here, it is not enough to say that neuron A fires at the same time as neuron B and therefore increases the weight between them. For synapses between cortical or hippocampal pyramidal neurons, a presynaptic spike a few milliseconds before a postsynaptic one typically leads to long-term potentiation (LTP), which strengthens the weight whereas the reverse timing leads to depression, which weakens the weight. This way a weight only gets strengthened from an event earlier in time to an event later in time and not vice versa. But this is not the whole story either.
One anomaly occurs when triplets of neural spikes are applied, instead of just pairs.
Consider this triplet

post then pre then post
pre then post then pre

where (‘pre’ is the firing of a presynaptic neuron, and ‘post’ is the firing of a postsynaptic neuron)

Both triplets contain two transitions: (post, pre) and (pre, post)

The only difference is the order of the transitions.

They both should have the same effect, but pre-post-pre with the same timing difference leads to insignificant changes, whereas post-pre-post induces a strong potentiation of synapses in hippocampal cultures.

So Claudia Clopath and Wulfram Gerstner came up with a simple but better theory.

First some terminology. Let U be the instantaneous voltage in the postsynaptic neuron. Let U~ be the voltage over a longer time period in that neuron (a low pass time filter of depolarization will give you that.) You can think of U~ as a moving average of voltage over a longer time period.

They found that if the post synaptic neuron had been depolarized for some time, and subsequently a presynaptic spike occurs, the result is ‘depression’ of the connection. When they say depolarized, they mean depolarized above a threshold T-. T- is a lower threshold compared with T+, which comes into use in ‘potentiation’.
This sequence might happen if a postsynaptic spike happened recently, so the depolarization of the postsynaptic neuron had not completely descended to normal values, and then a presynaptic spike comes in.


Potentiation of the synapse occurs if the following three conditions are met simultaneously:

(i) The momentary postsynaptic voltage U is above a threshold T+ which is around the firing threshold of the neuron. (ii) The low-pass filtered voltage U~ is above T-. (iii) A presynaptic spike occurred a few milliseconds earlier and has left a “trace” x at the site of the synapse. The trace could represent the amount of glutamate bound at the postsynaptic receptor; or the percentage of NMDA receptors in an upregulated state or something similar.

Note that the postsynaptic neuron enters the picture twice. First, we need a spike to overcome the threshold T+ and second, the filtered membrane must be depolarized before the spike. This depolarization could be due to earlier action potentials which have left a depolarizing spike after-potential which explains the relevance of post-pre-post or pre-post-pre triplets of spikes or to sustained input at other synapses.

The model takes a weighted combination of both the influence on depression (by the spikes and longer term voltage), and the influence of the same measures on potentiation, and that combination predicts what will happen to the weights in the neuron. In other words, the model doesn’t assume that only one process happens at a time at a synapse, it assumes both happen to some extent, at the same time.

For plasticity experiments considered here, it is crucial to have a spike after depolarization in order to have a trace of the spike lasting for about 50 milliseconds

The model has some complications, for instance:

The plasticity model depends directly on the postsynaptic voltage at the synapse; depending on the location of the synapse along the dendrite, the time course of the voltage is expected to be different.

This means that when a post-synaptic spike occurs, and the depolarization spreads all over the neuron, both forward and backward to the dendrites, since some dendrites are further away than others from the initiating point of the spike their potential will be different and therefore the model might predict different weight changes in one area than another of the same neuron.

In another paper, Claudia Clopath and Jacopo Bono describe a different type of spike altogether. Not only do neurons have somatic spikes that travel down the axon and release neurotransmitter, but they have NMDA spikes – which occur most often in the distal parts of dendrites (the far ends of the dendrites away from the soma). (Somatic spikes are more easily triggered in the proximal parts of dendrites). A somatic spike is the firing of the neuron, an NMDA spike is a brief spike (usually) at a dendrite that cannot fire the neuron by itself. dLTP is the abbreviation used for potentiation in the dendrites. One important difference between learning caused by the two types of spikes is that the target neuron has to actually fire (an action potential) for learning to happen with somatic spikes, but the target neuron does not have to fire for learning to occur with NMDA spikes.

Clopath and Bono speculate on the implications:

For this purpose, we consider a memory which associates several features. Such an association should be robust when presenting only its components, since neurons taking part in multiple assemblies and various sources of noise can lead to an incomplete activation of the association. For example, imagine that we have learned the association “coffee”, with components such as the colour brown, its taste, its smell, etc. We do not forget to associate the colour brown with the item “coffee”, even though we experience that colour much more often than we experience coffee…

We studied how ongoing activity affects memory retention in networks. In the first network, we implemented four groups of neurons, which we take to represent four features that constitute one association. For example, with “coffee” one could associate the features “drink”; “colour brown”; “hot”; and “something you like” The network neurons are all-to-all connected and the connections are randomly distributed across distal and proximal compartments. Importantly, the distal connections coming from neurons of the same feature are always clustered on the same distal compartment post-synaptically. We simulated ongoing activity by randomly choosing a feature and activating it. Carrying on the previous example, we can imagine that we encounter a brown colour, a drink, something hot and the something you like at occasions other than when thinking about coffee. Since the neurons from these different features are never activated together, proximal weights between different features are weakened.

In other words, the various attributes of coffee don’t always go together. Sometimes the color brown might fire when viewing soil, or chocolate. So a target neuron for ‘coffee’ might sometimes get an input from ‘brown’ when the other attributes of ‘coffee’ are not present. The neuron won’t have enough inputs to fire, and this means the link from ‘brown’ should weaken.

The authors continue:

However, the active features always stimulate distally projecting clustered synapses, and NMDA spikes will be evoked more easily. As a result, we find that the distal weights between neurons of different features do not depress substantially compared to the proximal weights….[we] explored how the learning and re-learning of such associations affect each other. We divided a network into two associative memories, for example one “chocolate” and the other “coffee”. Each consists of 4 groups of neurons, representing different features of the association. Both chocolate and coffee share colour and “something you like” features while having two unshared features each…
Our simulations suggest that dLTP allows a subset of strengthened weights to be maintained for a longer time compared to STDP. Due to this mechanism, a trace of a previously learned memory can remain present even when the memory has not been activated for a long time. dLTP protects the weights from being weakened by ongoing activity, while synapses unable to evoke dLTP are depressed.

The authors also speculate that the NMDA spikes, which have a depolarizing affect on the Soma, make it easier for normally weak inputs to trigger a somatic spike. So the NMDA spikes act as a ‘teacher’ that gates an input.

At minimum, this research shows that the learning even in a single neuron is more complicated than had been thought.

To make things even more complicated, there is a new theory from a group in Israel that the synaptic weight is not the same as a ‘dendrite’ weight, so you could have two or three synapses merging onto a dendrite segment, and the dendrite segment would have its own separate weight that also learns.


In their paper, the dendrite learns using the same rule as the synapses, an STDP rule that uses some of the same information (such as the depolarization of the target neuron) and the results are weights that do not end up at extremes of high or low values, but which can stabilize at intermediate values (though those values can oscillate.)  Nobody has yet found oscillations in dendrites, but they would be difficult to find.

It is likely that neural nets of the future that are based on current biology for inspiration will have units that are more complex than today’s. In fact of the authors of the dendrite learning paper say that so many different firing patterns are created by their architecture that “notions like capacity of a network, capacity per weight and generalization have to be redefined” and that should include “the possible number of oscillatory attractors for the weights… [with] their implication on advanced deep learning algorithms”


Voltage and spike timing interact in STDP – a unified model by Claudia Clopath and Wulfram Gerstner
Modeling somatic and dendritic spike mediated plasticity at the single neuron and network level  by Jacopo Bono & Claudia Clopath
OPEN Adaptive nodes enrich nonlinear cooperative learning beyond traditional adaptation by links by  Shira Sardi, Roni Vardi Amir Goldental, Anton Sheinin, Herut Uzan & Ido Kanter