Creating Neural Nets based on a Free-Energy principle

In 1982 John Hopfield invented a neural net that minimized a set of constraints. If two nodes of the net were meant to go together, they would start off with a positive weight between them. If two nodes were incompatible, they would start with a negative weight. Sometimes this would lead to conflict – For instance, if nodes A and B were supposed to go together, and A and C were also supposed to go together, but A and C were supposed to be incompatible.  All nodes are connected to each other, and send a signal to all their neighbors.

An example of such a conflict is if node A signals the desire to have cake, and node C signals the desire to be thin, and node B signals the desire to attend a big party. You can create a large net with conflicting constraints (fixed weights) and let the activations evolve, using Hopfield’s learning algorithm, until the pattern of activations settles in a minimum where conflict is minimized, and harmony is maximized.    A conflict in such a net would be where two nodes are connected by reciprocal positive weights, but one node has an activation that is negative, and the other has an activation that is positive.  So each node is telling the other that it is ‘wrong’

Hopfield’s algorithm minimizes a measure he calls ‘Energy’. The equation for this is:

hopfieldenergy
The goal is to minimize this energy. There is a measure called “Harmony” which is the above equation without the negative sign so minimizing energy is maximizing harmony. For convenience I’ll use harmony to make this point: when nodes X and Y are both ‘1’ (they can be either 1 or -1) and the weight between them is 1, then the harmony is 1 * 1 * 1, which is the highest it could be for two nodes. So if two values have the same sign, and the weight between them is positive, then they reinforce each other. If X was -1 and Y was 1, and the weight between them was -1, then that also results in a large harmony (-1 * 1 * -1 ==> 1), and that is desired, since values with different signs are incompatible, and the weight between them should reflect that.
In reality there are more than two nodes in most Hopfield nets, and every node is connected to every other node.

It turns out that the “Harmony” function has close relationships to a quantity used in modern day machine learning known variously as an evidence lower bound (ELBO) or variational free energy. In other words, finding the right sort of solution – that satisfies multiple constraints – can be thought of as minimizing free energy. So what is free energy? In thermodynamics, free energy is the energy available to do work, where unavailable energy is known as entropy. This means maximizing Harmony, or minimizing free energy, reduces the energy or tension in a system, while maintaining a high entropy. So what is entropy?

In the physical world, entropy is a measure of disorder – for instance, if you have a group of balls confined in a metal triangle in a pool table, and you remove the triangle and hit the balls with the cue stick, then they can take on many configurations, as they scatter across the table. It would be difficult to hit the balls again and make them organize into the original triangle. In theory it is possible, but the probability would be low. It’s much easier to go the other way.

So you might wonder whether ‘energy’ and ‘entropy’ as defined in the Hopfield net signify anything more than an analogy, though they do use similar math to thermodynamic free energy.   There is even more to the analogy – Hopfield nets traverse energy landscapes as they resolve the conflicts between nodes; very much like the physical force of ‘gravity’ is manifest in a landscape of basins in space-time created by massive planets, for example.

I asked Professor Karl Friston (a professor at the Institute of Neurology, University College London), who has created nets based on a theory of “free energy”, what energy and entropy means. I said that my intuitive idea of energy was that the amount of conflicting constraints is energy, and when all constraints are satisfied, you have an energy minimum.    He replied:

…free energy comes in many guises. It is also called an evidence lower bound (ELBO), bound on integrated or marginal likelihood and so on. The reason it is called free energy is that it can always be decomposed into an energy and entropy term. These can be rearranged into a term that corresponds to the evidence (the probability of data under a particular model) and a KL (Kullback–Leibler) divergence that is a non-negative bound. In a sense, your intuition about energy is correct. In the variational free energy, the energy is the negative log probability of a particular configuration of causes and consequences. Finding the most likely configuration is equivalent to minimizing the energy – much as in statistical thermodynamics.

The entropy means what it says. It is a Shannon entropy of approximate posterior beliefs. This is closely related to Occam’s principle.

The remark comparing entropy to Occam’s principle is interesting. Occam’s principle is the problem-solving principle that, when presented with competing hypothetical answers to a problem, one should select the one that makes the fewest assumptions. In other words, if you can explain something without committing to a highly implausible configuration of causes (e.g., neatly arranged billiard balls and a nice triangle), then it is – provably – a better explanation (in the sense of universal computation and Bayesian or abductive inference).

An article on the Hopfield net defined Hopfield’s energy as: “a measure of constraint satisfaction–when all the constraints are satisfied, the energy is lowest. Energy represents the cumulative tension between competing and cooperating constraints.”

The nets that Professor Friston develop are organized as hierarchies. Sensory information comes at the lowest part of the tree (or hierarchy) and learned predictive information comes down the tree. The two streams meet, and if they don’t match, then the mismatch or error information propagates upward, to change the weights of the net so that it produces better predictions in future. The algorithm tries to minimize free energy, which also means that it tries to minimize surprise – or more simply the mismatch between the data at hand and predictions of those data (i.e. prediction error).   The mismatch is calculated at every level.

Karl’s nets are recurrent – signals don’t just go up the hierarchy, they also go down the hierarchy. So a low level node affects an upper level nodes whose changes then affect the lower level nodes. So mismatches and bad predictions get minimized. The same idea applies to backpropagation in neural networks – a mismatch between desired result and actual result is minimized.  Interestingly, exactly the same sort of architecture is found in machine learning in the form of bottleneck architectures (e.g., variational auto encoders); where (ascending) signals are forced through a ‘complexity minimizing’ bottleneck to fan out again to provide (descending) signals that predict the original input.

If you are interested in how the free energy framework leads to neural nets with simple update rules for each neuron, see the tutorial by Rafal Bogacz in the sources below.

In general this type of net uses ‘‘predictive coding’’, because some of the nodes in the network encode the differences between inputs and predictions of the network. In other words, there are neurons just to measure ‘mismatch’ between prediction and reality at every level in the hierarchy.
If we call the other neurons “state neurons” then these nets are built as follows:

The mismatch neurons pass prediction errors forward to state-units in the higher level and laterally to state-units at the same level. The reciprocal influences of the state on the error-units are mediated by backward connections and lateral interactions. In summary, all connections between error and state-units are reciprocal, where the only connections that link levels are forward connections conveying prediction error to state-units and reciprocal backward connections that mediate predictions.

This architecture that segregates some neurons just for detecting mismatch has been disputed by the group at the University of Colorado led by Randy O’Reilly – and I’ll have a post on his work in a few days.  He actually thinks the error of all levels is projected on a common area in the brain.  On the other hand, there are lots of people are now finding empirical evidence for mismatch architectures distributed throughout the cortical hierarchy (e.g., https://www.ncbi.nlm.nih.gov/pubmed/28957679)

In any event, free-energy ideas work, and working neural nets that control robots have been built on these principles by a company called Brain Corp.

Karl takes the view that we can look at sensory information, and the actions that result, as one system that tries to minimize free energy. The more we understand the world, the more we can predict it, control it, and reduce unpleasant surprises.
It has been objected that the best way to minimize surprise would be to go into a dark room and stay there. Obviously, animals and people do not do that. I asked Professor Friston about that, and he said:

…the imperative to minimize expected free energy is the same as minimizing expected surprise or uncertainty. Heuristically, the best way to minimize uncertainty about the cause of sensations is to switch on the light in the dark room. In other words, the dark room problem is a false problem. (In fact, we tend to avoid dark rooms whenever possible – unless we want to go to sleep! )

There is a paradox that experts, when solving a problem, may show less activity (in a MRI looking at the brain) than novices do. I mentioned that as well, and Professor Friston replied that this is true, and that the way of looking at this was that

the free energy can always be decomposed into accuracy and complexity (by rearranging the energy and entropy terms). People with inefficient models have more complex models and a greater computational complexity cost. Metabolically, this usually manifests as a greater brain activation – until they then acquire a more efficient model or representation of the task at hand.

A skeptic might object that life is full of surprises. Most Americans were very surprised when hijackers flew airliners into skyscrapers in New York, partly because the motives for the attack made no sense. But minimizing surprise on all time scales and experience meant having to accept that these things happened. If we were to reject the facts about that terrorist act, we would have to create even greater surprise – that our media and government conspired to fool us about it. Some people, oddly enough, believe that. Perhaps there is a ‘free energy’ gone wrong with ‘conspiracy theories’?

The idea of minimizing surprise also comes up in various practical applications such as the uses of Information Theory.   It is now more than six decades since Claude Shannon set out this framework, with its  core idea of equating generation of information with reduction of uncertainty (i.e.,“surprise”). Showing that surprise could be quantified in terms of the range of choice applying, Shannon gave the framework mathematical teeth, enabling the extraordinary diversity of applications that continue to this day.

Friston points out also that

Under the free-energy principle, the agent will become an optimal (if approximate) model of its environment. This is because, mathematically, surprise is also the negative log-evidence for the model entailed by the agent. This means minimizing surprise maximizes the evidence for the agent (model). Put simply, the agent becomes a model of the environment in which it is immersed. This is exactly consistent with the Good Regulator theorem of Conant and Ashby (1970). This theorem, which is central to cybernetics, states that “every Good Regulator of a system must be a model of that system.”

People do not share their entire models of the world. We share low level models – for instance, we assume that if we throw a spear, it will come down somewhere, but higher level models – such as how economies work, or our own history – are often quite different. I think this is because there is less room for error on the lower levels. In a practical sense, if you do not realize that gravity works on spears, you are not going to survive very long as a hunter gatherer, but if you believe in Keynesian economics vs your neighbor who believes in Hayek’s economics, you can survive without problem. Much of our reasoning occurs in a world where we can’t have all the answers, and we have to infer plausible bridges into the unknown.

Sources:
Karl Friston’s website: http://www.fil.ion.ucl.ac.uk/~karl/
A Tutorial on the Free Energy Framework for modeling perception and learning – Rafal Bogacz – Journal of Mathematical Psychology (2015)
Tutorial on Hopfield Networks: (from the University of Minnesota Psychology Department) – note, this uses node activities of 0,1 instead of -1,1 – the Wikipedia article on the topic uses -1,1 – both are valid.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s