A ‘family-tree’ of words from a recurrent net

Back in 1990 (28 years ago now), Jeffrey Elman wrote an interesting paper titled “Finding Structure in Time”. He took a standard backpropagation net, with one hidden layer, and fed that layer back into the inputs.

To review what a backpropagation net is: it’s just a way of learning associations between a group of inputs and an output. The inputs could be a set of pixel values for a picture, and the output could be a binary category like “is it a cat” – which is 1 if the picture is of a cat, and 0 otherwise. The associations are learned by adjusting weights on connections leading from the inputs to the output(s). Usually there is a hidden layer, which gives the net more power to solve problems. An example neural net that has 3 inputs, 2 outputs, and a hidden layer might look like this:

image001

When you present a pattern to the input units, for instance a pattern of 1’s and 0’s, they send their activation over weighted connections to the hidden nodes. Each hidden node combines its incoming weighted signals, applies a function to that combined value, and outputs the result of the function on its outgoing connection (which in this case leads to the output layer). The output node(s) do the same operation. If the weights are adjusted properly, (by a training algorithm), eventually the net learns to predict the correct values.   So in the cat example, when you present a picture of a cat to the input units (as a group of pixel values), the output unit for ‘cat’ might output a ‘1’, and the output unit for ‘dog’ might output a ‘0’ and if you had an output unit for ‘tiger’ it might output some intermediate value.

Suppose you had a net with two inputs, one output, and 2 hidden nodes in the hidden layer. You could use a net like this to learn to predict problems such as XOR (Exclusive OR), which outputs a ‘1’ if the sum of the inputs is odd, and a zero if the sum is even.

To recap, you have some patterns (in the case of XOR only 4 patterns) and each pattern helps the net learn a relationship.

(There are many papers explaining backpropagation on the internet (here is one) and my assumption when writing this blog is that you know what backpropagation is, and that you have a background in high-school math, including calculus.)

The improvement Elman made was to feed back the activations of the (in this case) 2 hidden nodes from the hidden layer back to the input layer. This means now (in this example) that the input layer has 2 extra nodes, for a total of 4, where the first 2 nodes are the standard inputs, and the second nodes are ‘context nodes’, receiving inputs from the two nodes (respectively) of the hidden layer. Doing this doesn’t make sense for a simple XOR problem, but it would make sense for a series – perhaps a series where every bit is the XOR of the 2 bits before it.

For an XOR series starting with 0 and 1 you would get

0,1,1,0,1….

This series could go on forever, but since the net has 2 inputs, its presented in consecutive groups of 2 as:

0,1 –> 1

1,1 –> 0

1,0 –> 1

Etc.

There is a time step involved. Initially, the hidden nodes are assumed to have an arbitrary value – such as an activation of 0.5. So instead of just presenting – say – 0, 1 to the input layer, we present 0, 1, 0.5, 0.5 to the 4 nodes in the input layer. The last two inputs are supposed to come from the activations of the hidden nodes, but at this point there haven’t been any activations there yet.

At time ‘t’ = 2, we look at what the activation of the hidden nodes were when the first pattern was presented (at time t=1) and feed them back. Suppose, after the inputs were presented and the hidden nodes combined them, that the hidden layer’s first node activation was 0.7, and the second was 0.6, then, from the above example we have

1,1,0.7,0.6 (as inputs).

It turned out that this type of recursive net could predict series of numbers successively presented over time. There was an actual memory in the net. The extra nodes in the input layer, which could Elman called ‘context nodes’, made information from a previous time available to the net, which meant that the hidden nodes became a function not only of the input/output relationship, but of themselves at a previous instant. And at the next instant, they became a function of themselves again, and since they were already a function of a previous instant, they inherited information from that instant as well. You could write it out like this

HidnodesTime2 = function of (HidNodesTime1, inputsTime1,outputsTime1)

HidNodesTime3 = function of (HidNodesTimes2,inputsTime2,outputsTime2)

Where ‘function’ is the function of the neural net – or at least the transformation between the input layer and the hidden layer.

Exclusive OR is a boring series, but Elman was just getting started.

He now made a sequences of letters. Each letter was 6 bits long, and each bit meant something. Here is a table from his paper:

image002

He found that in letter sequences, it actually helped that there was this kind of meaningful representation of letters. If he had used random bit-patterns to represent letters, the predictions would have been less accurate. You might not be surprised at this, for instance usually a vowel follows a consonant so providing information about each letter, such as whether it is a vowel or a consonant, is helpful.

He then tried meaningful word combinations. And here, he ended up with a “family tree” of word meaning emerging from this simple neural net.

Here is what he says:

As a first step, a somewhat modest experiment was undertaken. A sentence generator program was used to construct a set of short (two- and three-word) utterances. Thirteen classes of nouns and verbs were chosen; these are listed in Table 3.

The generator program used these categories and the 15 sentence templates given in Table 4 to create 10,000 random two- and three-word sentence frames. Each sentence frame was then filled in by randomly selecting one of the possible words appropriate to each category. Each word was replaced by a randomly assigned 31-bit vector in which each word was represented by a different bit.

He is saying that since his dictionary (which was very limited) contained only 29 words, then to represent each word, he just used 29 bits. For a particular word, all bits would be off except for one, which would be flipped on. (He used 31 bits instead of 29, to keep 2 extra bits for other purposes).

For this simulation the input layer and output layers contained 31 nodes each, and the hidden and context layers contained 150 nodes each.

The task given to the network was to learn to predict the order of successive words. That’s why the output had to have 31 nodes, this allowed it to represent the one word that was being predicted. In a sentence such as “boy chases dog”, the inputs might be ‘boy’ (with all the context nodes in the input layer being set to 0.5, which is halfway between ‘off’ and ‘on’) and then at the next time instant, “chases” and whatever values the hidden nodes had obtained when ‘boy’ was activated, and at the same time ‘dog’ would be presented to the output layer, and any mismatch between prediction and “dog” used to adjust the weights to make the net predict “dog”.

A problem arises since at any point there are various possibilities. A boy doesn’t have to chase a dog, he could chase a cat. A boy doesn’t have to chase anything, the sentence could be “boy sings.” However, there are probabilities, a verb will usually follow a noun, and some of the actions a “boy” prefers to do varies from what a “girl” does.

In Elman’s words:

For any given sequence of words there are a limited number of possible successors. Under these circumstances, the network should learn the expected frequency of occurrence of each of the possible successor words; it should then activate the output nodes proportional to these expected frequencies.

So what did the net learn from the co-occurrence statistics of the words in the various sentences?

Elman did a statistical cluster analysis of the activation values of the hidden nodes (nodes of the hidden layer) after the net had learned all the patterns. In doing this analysis he would present a word, or the first words of a simple sentence, and then look at the values of the activations of the hidden nodes. He would record those values. The clusters of the Hidden-node patterns formed a hierarchy. Words that were closest in meaning created activation patterns in the hidden layer that were close to each other.

The clusters showed that the net learned that:

there are several major categories of words. One large category corresponds to verbs; another category corresponds to nouns. The verb category is broken down into groups that require a direct object, or are intransitive, or where a direct object is optional. The noun category is broken into two major groups: inanimates and animates. Animates are divided into human and nonhuman; the nonhuman are divided into large animals and small animals. Inanimates are broken into breakable, edibles, and nouns which appeared as subjects of agentless active verbs.

…The network is not able to predict the precise order of words, but it recognizes that (in this corpus) there is a class of inputs (namely, verbs) that typically follow other inputs (namely, nouns). This knowledge of class behavior is quite detailed; from the fact that there is a class of items which always precedes “chase, break,” “smash,” it infers that the large animals form a class.

image004

Elman notes that the hierarchy is “soft” and implicit. While some categories may be qualitatively distinct (i.e., very far from each other in space), there may also be other categories that share properties and have less distinct boundaries.

He says that there are limits to how much memory the net can have since:

There is limited numeric precision in the machines on which these simulations are run; the activation function is repetitively applied to the memory and results in exponential decay.

Though this work was done a long time ago, the ideas are being used today, notably by Brain Corp, which has a net that has a hierarchy of mini-nets, each of which use their hidden layer as a source of context for other mini-nets. I will talk about their research (they sell robots partly based on it), in my next post.
You could speculate where else this basic idea can be used. Can it find organizational structure in other types of data than text? Can it find patterns in history?

Sources:

Finding Structure in Time by Jeffrey Elman University of California, San Diego – COGNITIVE SCIENCE 14, 179-211 (1990)

A Brain that works on Images – Where the Images are Neurotransmitter Maps

Professor Douglas Greer has some original ideas on how the cortex works, and he has made a working program based on them, the source-code of which you can get for free. The most revolutionary idea is that instead of looking at the firing of a group of neurons as a representation of an object, we should look at the distribution of neurotransmitter at synapses as the representation.

First we should define some basic terms here, so that we can understand his theory.

  1. Manifolds:

A manifold is like a landscape with some measure on it. For example, if you took a map of Manhattan, and listed the heights of all buildings at all locations, you would have a 2-dimensional manifold. At certain points the heights would be zero (such as Central Park) at other points they would be large (like at the location of the Empire State Building).

A contour map showing the heights of mountains and valleys would also be a manifold. Manifolds can be 1 dimensional, 2 dimensional, 3 dimensional on up. A contour map of the United States is a continuous manifold – it isn’t divided up into discrete points, it’s just a continuous shape with slopes at any point varying from positive to zero to negative (with an occasional abrupt discontinuity like a cliff). Manifolds can be discrete as well, and give that computers are digital, even continuous manifolds must often be approximated by discrete points. If you are wading into a cold swimming pool on a hot day, your skin can be thought of as a manifold with a measure of temperature, where your legs are colder than your upper body.

  1. S.R. flipflop

One interesting aspect of Greer’s ideas is that he takes human inventions, and finds counterparts to them in the brain. The S.R. flipflop is one example, wavelets are another. Let’s look at the S.R. flipflop:

image001

A logic gate is a device that takes in 2 inputs. The output is a logical operation on those inputs. For instance, an OR gate outputs a ‘1’ (for TRUE) if either or both of its inputs are ‘1’. Intuitively, this is like saying “if I take a nap on the beach, or I go to a tanning salon, I will get a tan.”. If either or both premises are true, then the conclusion will be true. Using an AND gate is like inferring: “If there is a sale on the 3 remaining iPads at ‘Best Buy’, and I am first in line when it opens, I can purchase all three”. In other words, both premises have to be true, for the conclusion to be true.

A NOT gate has only one input. If you feed it TRUE, the output is FALSE, and vice versa. (Or if you feed it 0, the output is 1, and vice versa).

A NOR gate inverts the conclusion or an OR gate, it negates it. It is like an OR gate in series with a NOT gate.

You can create truth tables for these gates. For an OR, the truth table has 3 columns, the first two being the inputs, and third being the output. It would look like this:

0,0,0
0,1,1
1,0,1
1,1,1

The first row shows that if neither premise is true, then the conclusion is not true either.

The basic building block that makes computer memories possible, and is also used in many sequential logic circuits is the flip-flop or bi-stable circuit. Just two inter-connected logic gates make up the basic form of this circuit whose output has two stable output states. When the circuit is triggered into either one of these states by a suitable input pulse, it will ‘remember’ that state until it is changed by a further input pulse, or until power is removed.

The SR flip-flop can be considered as a 1-bit memory, since it stores the input pulse even after it has passed. Flip-flops (or bi-stables) of different types can be made from logic gates and, as with other combinations of logic gates, the NAND and NOR gates are the most versatile, the NAND being most widely used. This is because, as well as being universal, i.e. it can be made to mimic any of the other standard logic functions, it is also cheaper to construct.

Here again is the diagram of a S/R (set / reset) flipflop.

image001

The output of each gate is connected to one of the inputs of the other gate.

The circuit has two active low inputs marked S and R, ‘NOT’ being indicated by the bar above the letter, as well as two outputs, Q and Q. Table 5.2.1 shows what happens to the Q and Q outputs when various inputs are applied to either the S or R inputs.

image002

Each gate here is a NAND gate, that means it has the truth table of an AND, with a negation afterwards.

Below I explain what happens when you apply an input – this is important to understand because Greer thinks an analogous mechanism works in the brain:

Let suppose that the initial S and R inputs are 0, which means their negations are 1 (Their negations are fed into the flipflop, as you can see in the diagram)

Suppose you apply a one to the S input. S-bar (negation) becomes 0. This will make the Q output become logic 1. (Q is considered the value of the flipflop, so at the moment the value is 1. There is also an output coming out from the lower gate, which is supposed to be the negation of whatever Q happens to be at the moment.

Why did the zero at S-bar lead to a 1 at Q? because an AND gate outputs a zero if any of its inputs are 0. This is a NAND gate, so the output 0 is inverted to become a 1. The 1 comes out of the upper gate at Q, and also gets fed into the lower gate. Remember we said that R is 0, so R-bar is 1, so we now have two ones feeding in as inputs into the lower gate. An AND of 2 ones is a one, but this is a NAND gate, so the one gets inverted to be a zero, and comes out as Q-bar. That’s good, because we want Q-bar to be the negation of Q. The zero also gets fed into the upper gate as an input, so along with S-bar, that means that the inputs are a 0 and a 1 and we know that NAND’s output a one in this case.

The interesting part now is that returning the S input to logic 1 has no effect. The 0 pulse has been ‘remembered’ by the Q. S-bar may now be zero, (the other input is also zero), so the NAND output remains as 1.

Q is reset to 0 by logic 0 applied to the R input (R means reset).

Even when R returns to logic 1 the 0 on Q is ‘remembered’ by Q.

There are problems with the SR Flip-flop. For instance, there is a state where it keeps flipping output values. This happens if the inputs change from 0,0 to 1,1 together.

Basically the SR Flip-flop is a simple 1-bit memory. If the S input is taken to logic 0 then back to logic 1, any further logic 0 pulses at S will have no effect on the output.

In Douglas Greer’s model of the cortex, instead of bits being fed into this setup, entire images are passed on the S and the R lines. If the images are ‘reciprocal’, (just like true and false are reciprocal), then an entire image is remembered on the Q output.

You might wonder how images could be reciprocal, but remember that any logical operation can be represented as a truth table, which means that are associating various combinations of values with another value. Neural nets are all about learning associations, so you can create a truth table to make Greer’s memories work. One advantage of these types of memories is that if the image is imperfect, or somewhat noisy, it will converge on the correct image.

  1. Wavelets

Another human-invented technique that the brain may also use is ‘Wavelets’. If you look at an image, it may have repeating areas. For instance, an image of a checkerboard obviously does. There is a certain frequency that the black squares repeat at. It turns out that even less-obvious examples of images can be represented as a combination of frequencies. If you look at a sine wave:

image003

You notice that it has both a frequency and an amplitude. An image can be converted, via a Fourier transform, to a set of coefficients of sine (and cosine) waves at different frequencies. You can plot the coefficients on the Y-axis, and the frequencies on the x-axis. This is “frequency space”. The transform can also be used to recreate the original image, and even clean it up a little. For instance, ‘noise’ is often at higher frequencies than the details of the image, so you could remove some of the high frequencies before recreating the image. Even if the higher frequencies are not due to noise, they might be due to small details that are hard to make out anyway, and you can compress an image by removing them from the Fourier transform before you recreate the image.

Fourier transforms have disadvantages. For instance, you are limiting yourself to using waves that go on forever in both directions, and those waves are poor at picking up some types of local features.

Wavelets were invented to make up for the deficiencies. A wavelet (there are many types) might look like:

 image004

 This wavelet does not go on forever in both directions, in fact, it can be centered at a particular spot. So you can go through an image, with wavelets such as this, trying to fit them at regular intervals. Where they fit well, you will get a large coefficient, where they fit poorly, you will get a low coefficient.

You can then go through the image one more time, but with a dilated form of the wavelet (think of stretching it sideways). In the above example it would cover more space on the image, and the  peaks and valleys would be more spread out.

So with each pass across the image, you have another set of coefficients. For a particular spot on the picture where a wavelet was sampled, given a particular dilation, there will be just one coefficient.

Greer’s idea is that images in the brain can be represented by perhaps parallel layers, each layer corresponding to one dilation of the wavelet. He notes that in your retina, there are neurons that respond when a cell is not firing, but the cells in a circle around it are firing. This can be thought of as a wavelet. He also notes that some of these neurons have a wide receptive field, and others have a smaller one, so the dilation of the wavelet differs.

(It has been shown that different size fields project to different areas, so you have a fine-detail wavelet representation in one layer, and a broader view in another.)

So what is Greer’s theory?

First, on manifolds:

Physical quantities such as the intensity of light impinging on the surface of the retina, the air pressure on the eardrum over time, the position and temperature on the skin surface and the forces over entire muscle cross-sections, are all functions defined on manifolds. We refer to these functions as images or fields.

(Note that an image, in his terminology, is not just a picture taken through a camera, it can be any map of some measure).

Secondly on flip-flops:

He notes that a flip-flop…

 can be viewed as a dynamical system with two fixed point attractors that correspond to “0” and “1”. The set of all values that converge toward a particular attractor is referred to as an attractor basin.

We construct a dynamical system analogous to an SR flipflop where the bits are replaced by images and the logic gates are replaced by feed-forward image association processors. The convergence of the system toward a reciprocal-image attractor is then demonstrated.

It is important that images be attractors–over time their components change to converge on one prototype image– for the same reason that computers deal with 1’s and 0’s and not fractional voltages.

One example of mapping between manifolds is from the image on the surface of the retina to a model of three-dimensional physical space. This transformation is the shape extraction problem of computer vision. It has been described as the inverse of computer graphics, which transforms three-dimensional spatial representations to a two-dimensional visual image.

 Another example of a mapping would involve hearing.

Tonotopic maps, which map audio frequencies to locations in the brain, have been documented and studied for some time. Some of these resemble audio spectrograms which plot time in the horizontal direction and frequency in the vertical direction. In the computational manifold approach, a child learning new nouns is learning associations between visual images and audio images.

In motor control, a manifold is involved as well:

Fibers within the same muscle may pull the limb in different directions, but close-by fibers pull in a similar direction

Consequently, the two-dimensional image of the muscle cross-section smoothly maps to a manifold in a tangent bundle that describes the forces exerted on the limb. Images on the surface of the cross-section can be used to describe the efferent nerve signals that control the contraction of the muscle fibers as well as the afferent signals that convey the fiber length and tension.

In other words, one manifold coming into the brain might be the signals (arranged as coming from a surface) indicating the tension of your various muscle fibers. Another manifold, this one coming from your brain to your body, might be the signals directed at controlling your muscle fibers. In his article, he says that the ability to carry out motions (such as grasping an object) given your physical state can be looked at completely as mapping between ‘images’.

Smell also involves a manifold.

The olfactory bulb responds to odors with a spatial map generated by a distributed assembly of specific and general molecular receptors. As a result, different mixtures of odor molecules produce unique odor maps. Moreover, there are general similarities between the retina and olfactory bulb in the cellular circuits producing lateral inhibition and receptive fields, which indicate conserved mechanisms of neural processing in vision and olfaction. This is consistent with an image association model of sensory information processing in a frequency space such as one generated by the continuous wavelet transform.

Though computers use Boolean logic (true/false logic), and operations such as addition and multiplication on binary digits, Greer points out that binary digits are abstract mathematical concepts.

If someone were to take apart a typical desktop computer searching for ones and zeros, they might be disappointed to find only analog resistors and transistors. The digital circuit specifications themselves refer only to acceptable ranges of voltages on the inputs and outputs. These specifications are, in effect, a contract with the digital circuit designer that defines a guaranteed behavior. This concept of equating symbols with behavior can be extended to computational manifold automata.

In the SR flip flop, the circuit acts as a dynamical system with two attractors corresponding to “0” and “1”. The two binary digits are, in effect, voltage ranges where the circuit conceptually falls into one of two “energy wells”. These fixed-point attractors create the stability required for an actual physical realization to operate in the presence of the inevitable noise and transient errors in the inputs.

Greer defines a Λ-map (Λ from the Greek word Logikos) as a feed-forward process that accepts one or more images as inputs and produces a single associated output image.

We can create a circuit analogous to an SR flip-flop by replacing the bits with images and replacing the NAND gates with Λ-maps. The recurrent connections are shown below where the exterior and interior Λ-maps are labeled ΛE and ΛI. We refer to the two outputs as reciprocal images and use the term psymap to refer to the structure containing two recursively connected Λ-maps.

In other words, the Logikos (Λ) map maps one or more images to an output image, and the psymap combines 2 Logikos map building blocks in an analogy to the S-R flipflop.

image005

Then Greer says:

Rather than having only two attractors for “0” and “1”, as is the case for an SR flip-flop, a psymap can have any number of attractors. Each reciprocal-image attractor can be constructed from two arbitrarily chosen reciprocal images. Let q = ΛE(p, s) and p = ΛI(q, r) denote the exterior and interior Λ-Maps shown in Fig 5(b) and let Null denote a predefined “blank” image. For an arbitrary collection of image pairs (ai, bi) we can create a new attractor by adding associations to both Λ-Maps so that ΛE(bi, Null) = ai and ΛI(ai, Null) = bi. In addition, we can define associations for the S and R inputs – for example ΛE(X, si) = ai where X is any image – that allow us to force the psymap to the (ai, bi) state.

I think he means that something like the diagram below becomes possible – since NULL and b(i) gives a(i), and NULL and a(i) gives b(i).   Here images a(i) and b(i) act like reciprocals.

image006

Professor Greer sees the various psymaps as connected via a type of bus (see diagram) Each line in the diagram is conveying an entire image.  Psymaps can obtain images from the bus, or can output images to the bus for other psymaps to pick up.

image007

Neurons connect to other neurons via synapses. If at a particular instant, you look at the neurotransmitter concentration in all these synapses, you have a measure on a map, or a manifold. Each synapse has its own measure on the map of neurotransmitter concentrations. Greer sees each manifold as a state that combined with inputs can make a transition to another state.  In everyday life, a state-transition diagram (for an ATM card) might look like this:

image008

Greer’s analogy:

 As an informal metaphor, we can imagine a room with a collection of flat-screen displays which are labeled 1 through N. Each display screen is a flat surface, i.e. a rectangular two-dimensional manifold, which we label Mi. The visual image currently being displayed on a screen can be represented by a function fi(x,y) where x and y parameterize the horizontal and vertical dimensions of the screen.

So each TV screen has a 2D image, which varies over x and y

The entire set of images currently being displayed on all screens is the system state.

At the core of an automata is a state transition function which determines the next state based on the current state and any inputs (the inputs would be images as well). The current state is the collection of images currently being displayed on the flat-panel screens. The next state is also a set of images which are uniquely defined by the current-state images and the input images. Since the inputs, outputs and current state are functions, we refer to the mapping between the current state and the next state as the next state transformation Z. Another transformation, T, generates the functions defined on the output manifolds based on the current state and the inputs.

In the diagram below, each PE is a neuron making a bridge between two manifolds. The PE’s shown are part of Logikos maps Λi and Λe.  Each PE maps chemical concentration at its dendrites to chemical concentrations at its axons. The vertical rectangles are manifolds.  In this particular diagram, the output is fed back to the input, and the output image is meant to be the same as the input image.   There  is a time delay in making the  trip through the mapping, so if we call the image by the letter ‘Q’, then the output is Q(t+1). The output is fed back into the input with hope that it  will converge to the prototype symbols Q.

image009

Even though the dendritic and axonal trees of each PE cover only local limited areas, the recurrence allows their effect to spread over the entire image. This works because each PE creates a neurotransmitter map that affects several other PE’s (neurons) in the opposite Λ-map. These in turn form many local loops by feeding back into the original PE.   (note that in the illustration the arrows are labeled with Λi and Λe – each neuron is part of a Logikos map.)
A Λ-map can accept multiple input images by aligning them topologically as shown in the next image. One or more of the input images can serve as a control mechanism by regulating how the other images are processed. Image masks that overlay the multiple input images can be used to control the association formation process and to focus attention on specific regions
image010

So how would Greer’s theory apply to thinking?

For one thing, Greer says that we can view the recognition of unique individuals in the external world as the process of the psymap dynamical system moving toward a unique attractor.

On the topic of human language he says this.

Words spoken by many different people with a variety of vocal characteristics have the same spelling and meaning. These similar but distinct sounds are all mapped to the same symbol.

So when you hear the word “tomato” in a Scottish accent versus  a southern accent, the word converges in your mind to a prototype that you are familiar with.

There has long been a debate on how to define the border between concepts.  He says that the symbols can be defined as either sets or attractors.  (For instance, for the symbol “chair”, you might list everything you can think of that  fits the definition, and that  would  be a set).

These definitions are related, but while using sets may be easier initially, it is more difficult to maintain in practice…

As the dimension of the state space is increased, chaotic attractors are more likely to be created [5]. The boundaries between these attractors will be fractal, so while the sets corresponding to the attractor basins can be defined in theory, in practice they are impossible to precisely locate.

Imagine a high-resolution photograph of a room containing assorted items. If several people were asked to describe the room, their narratives would have significant differences in the words used and the grammatical constructions.

Consequently, it is difficult to define the photograph in terms of sets. However, if we consider words to be CMA attractors in a dynamical system, then we can describe the process as each person having a “visual mask” that changes in shape and moves around the photograph of the room, bringing various items into “focus”. The resulting image portion may come close to a reciprocal-image attractor that is associated with a distinct word. The images that form the system state then move toward that attractor, eventually causing that word to be spoken. In this approach, there is no need to define the actual sets. The symbols are defined by the system behavior….

Letters are visual glyphs that are associated with sounds. When we read, we not only understand the meaning of the words, but we instinctively know when the words rhyme. The words may evoke mental images that are then transformed back into other sounds. 

The presence of reciprocal-image attractors in connected psymaps can generate a “recognition cascade”. In one psymap, a small portion of an image moving near an attractor can generate a sequence of images moving toward that attractor. Its outputs may then cause the state of other psymaps to move toward related attractors. In this way, the slightest “hint” may evoke an intricate and detailed recollection.

The above is a summary of just one of Greer’s articles.  If you look at his patent on his website (gmanif.com), you will see more of his theory.   There are interesting implications of looking at the brain as a manipulator of maps.

Sources:

  1. The Computational Manifold Approach to Consciousness and Symbolic Processing in the Cerebral Cortex – Douglas S. Greer
  2. Patent: METHOD OF GENERATING AN ENCODED OUTPUT SIGNAL USING A MANIFOLD ASSOCIATION PROCESSOR HAVING A PLURALITY OF PAIRS OF PROCESSING ELEMENTS TRAINED TO STORE A PLURALITY OF RECIPROCAL SIGNAL PAIRS

(both sources can be found at Professor Greer’s company website: http://gmanif.com/.

  1. You can obtain his free source-code at: http://gmanif.com/contact.html#Sapphire
  2. S.R. flipflops http://www.learnabout-electronics.org/Digital/dig52.php

Ogma Corp’s Feynman Machine – it learns Attractors and can drive a car.

In a previous post, I wrote about Numenta, a company that attempts to model just one region of the cortex.   Since cortex looks very similar whether it is used for hearing, seeing, or other purposes, they believe there is a general basic algorithm that is used everywhere, and it makes sense to study the building block first.

Numenta makes its theory (Hawkins’ Hierarchical Temporal Memory – HTM) and algorithms available to the public.

In an article titled Symphony from Synapses: Neocortex as a Universal Dynamical Systems Modeller using Hierarchical Temporal Memory Fergal Byrne described how a region of HTM layers could form a module corresponding to a brain region. This region has feedback, in other words a signal from its output layer sends a branch to its input layer, and so the region becomes a “dynamical system”. Dynamical systems can exhibit behaviors such as chaos, and can have attractors. Attractors can be thought of as basins in a landscape, which once entered, cannot be exited. Some attractors are much more complicated than basins, but there still is a bounded area that a point traveling on that landscape can enter, but not leave.

twoattractors
Examples of simple point attractors on the left, a “strange” attractor on the right.

One HTM layer learns synaptic connections between neurons firing at a time (t), and a time (t+1). So the neurons learn to predict what will happen next. In a real cortex, we find that some regions detect only simple features (such as edges, corners and so forth), and others, further up in a hierarchy, detect complex combinations of features such as faces. So in a lower region, it may help in prediction to get feedback from a higher region – for example if you know you are looking at a pyramid, then on a lower level that knowledge may help predict that if you view the pyramid from the top, that you will see a lines converging to a point.

So Fergal and others designed a new hierarchical machine, that they called the “Feynman Machine” and they formed a company they named Ogma (their website is: ogma.ai) to explore its potential. The Feynman machine did not use HTM’s model of regions, instead they used “K sparse autoencoders” as their building blocks but the general idea was similar.

I will give a very general explanation of what they did here.

But first lets go back to the article Symphony From Synapses, which was written before the Feynman machine was created, and which uses HTM regions in its initial theory. Here Fergal starts off by explaining that our world is a world of dynamical systems, systems that are constantly changing and that produce a changing flow of information at our eyes, ears and skin.

In mathematics, some such systems are modeled as having an output that feeds back into the input.  For instance, if you were modeling the reproduction of rabbits, you might start out with 2 rabbits, that you calculate should be able to have 10 offspring, and now you feed in the number of rabbits (10) back into the equation to get perhaps 50 offspring in the next generation.   You can plot the resulting rise in population, which does not go on forever, due to constraints such as limits on food, water, or shelter.    In the diagram below (on the right) you see an equation that describes the growth of a population.   You can see it levels off after some rapid growth.

rabbits

Some dynamical systems are based on a few equations that depend on each other.  Here is an example of a path (trajectory) by a “strange attractor”.

strangeattractor

One characteristic of chaotic systems is that if you start your calculation with a number that is just slightly different than another, the paths they take eventually become very different.  (In this picture, if you think of a ball sliding on a wire, the new ball’s start point might be slightly off the  wire (on its own wire), and though it would travel for a while close to the other ball, eventually it would diverge dramatically).

The above picture shows 3 dimensions, but a Dutch mathematician named Floris Takens found that if you just sample one variable at intervals enough times, you capture the information relevant to you about the essential behaviors of the model.

For instance, sampling one variable (x) of this 3D chart can do this:

takens

The bottom chart is not charting X,Y and Z, because all you have is X.  It is actually plotting X against past versions of X in the series you obtained.   And in important respects, it models the original.

So Fergal suggests that our brains, not having access to all variables that effect what we perceive, do this type of limited sampling, and from it reconstruct real-world dynamical systems.

One model in our brain can be coupled to another model, and influence it, and eventually via our motor actions, control real dynamical systems (like an arm pitching a baseball).

So the brain would be capturing rules, just like the rules that generated  the strange attractor in the  illustration, and not just learning a sequence such as links between still-frames of a movie.

One advantage of a model is that you can run a simulation forward in time to perform forecasting.  If the simulation is incorrect, you can dismiss the difference as ‘noise’ or change the shape of the “landscape” (perhaps the equivalent of changing constants in your implicit equations).

HTM models the cortex as having only a small percent of its neurons firing at a time.  This is a sparse distributed Representation (SDR) of whatever inputs are coming in.
As inputs change, the SDRs also change, so you can think of a sequence of SDRs in time.   If the layer being modeled has 2048 neurons (which is typical in HTM implementations), it generally has 40 neurons on at any particular time (though the particular neurons that are on and off are constantly changing), and so we can think of SDRs being a single point traveling in a 2048 dimensional space. (For that point, all but 40 of the dimensions would have a value of zero at any time, but the others would be non-zero)

In the model below, the sensory inputs come into layer 4 (the layer closest to your skin is layer 1, and Layer 6 is deepest, but  the sensory info doesn’t go to 1 or 6 directly.)

cortextlayers

The illustration shows flows of inputs and control signals in Multilayer HTM. Sensory (red) inputs from thalamus and (blue) from lower regions flow to L4 and L6. L4 models inputs and transitions, L2/3 temporally pools over L4, passing its outputs up. L5 integrates top-down feedback, L2/3 outputs, and L6 control to produce motor output. L6 uses gating signals (orange) to co-ordinate and control inputs and outputs, and can execute simulation to run dynamics forward.

Layer 4 learns correlations between successive SDRs.  The successive SDRs could be formed in response to you moving your eyes from point to point on an object until you recognize it. If SDR ‘a’ usually comes after SDR ‘q’, then links will strengthen between the neurons of the patterns, so that ‘a’ begins to be predicted when you experience ‘q’.

Then a subpopulation of neurons in Layer 2/3 of cortex performs a function known as Temporal Pooling, representing sets of successively predicted SDRs in Layer 4 as a single, stable output. For instance, if the SDRs coming into Layer 4 represent different observations by you of a chair from different angles, the representation in Layer 2/3 might stay stable – if so, it would represent the concept “chair” while of course the representation in Layer 4 keeps changing. So the Temporal Pooling SDR can be seen as a kind of dynamical symbol for the sequence of SDRs currently being traversed in L4.  (You could also think of Layer 2/3 as learning the constants in the equations that are governing the behavior in Layer 4, though the constants are more like slow-changing variables).

L4 can be thought of as learning an attractor.  L2/3 also is learning an attractor, and if, for instance, you are shifting your eyes suddenly from one object to another, the information coming from L4 to L2/3 is  now so different that L2/3 changes its own representation.   This can itself be viewed as shifting in L2/3 of its SDR trajectory from one basin of attraction to another.   By definition, if a trajectory enters an attractor, it cannot leave, but if you change the governing constants of the implicit equation, you get a different path, with different attractors.

So now, finally, to the “Feynman Machine” and what its been used for so far:

Ogma Feynman Machine is a hierarchy of nonlinear dynamical systems.   There is plenty of feedback, so in each region outputs influence inputs, and also higher regions send feedbacks to lower regions. Outputs that influence inputs are a feature of dynamic systems that we saw even in our simple example of the multiplying of rabbits.

It has been shown that a number of discrete and hybrid dynamical system designs can simulate any Turing Machine.  A Turing machine is a very simple machine with simple rules, but it has the power of Universal Computation as does the laptop on your desk.   So why use dynamical systems instead of regular computer algorithms?

Fergal gives an example as follows:

A soccer player who runs into the box to head a crossed ball into the net is clearly not solving the simultaneous differential equations of a spinning ball’s motion through moving air, under gravity, nor is his run the result of preplanning a sequence of torques generated by his muscles. The player’s brain has a network of dynamical systems models which have, through practice and experience, learned to predict the flight of the ball and plan a sequence of motor outputs which will, along with intermediate observational updates and corrections, lead to the desired performance of his skill.

In the Feynman machine, each building block is a paired decoder and encoder.    A sensory input might be encoded, and then decoded back to a prediction of what it will be next, and the next signal coming in will be compared with the prediction.   A diagram of their machine looks like this:

encoderdecoder

Each encoder/decoder pair is a non-linear dynamical system. The encoder represents its input as an SDR with a limited number of cells firing, and the decoder uses an algorithm to learn a prediction of the next input SDR (at  time t+1), combining information from a signal coming down from higher up in the hierarchy, with the representation of the SDR. (The aspect of the output that feeds back into the input is the error signal between the decoder prediction for time (t+1) and the actual input pattern coming into the encoder.) 

Ogma sees a use for its program in anomaly detection (which Numenta’s algorithm is also used for).   If you follow a sequence, let’s say of credit card transactions, and then are surprised by an anomaly in the sequence that conflicts with predictions, you might be detecting a fraudulent  transaction. Other applications of anomaly detection include stroke prediction, monitoring of dementia sufferers,  monitoring of industrial plant and equipment, energy production, vehicle maintenance, counterterrorism, etc.

Ogma has also hooked up their architecture with deep learning modules, and even to a radio controlled self-driving car model that has a video camera attached.  You teach the car by guiding it down a few paths:

Its not easy to understand what is actually going on in dynamical system in the brain.   The connections between neurons, and the signals going up and down are mysterious  There has been some work in trying to understand the flow:

One paper, by Giovanni Carmantini et al (see sources) implements a Turing machine as a dynamical system.  You can put in known rules of the Turing machine into the dynamic system.  To quote the article: “the synaptic weight matrix (of the dynamical system) is explicitly designed from the machine table (of rules) of the encoded automaton”.  and   “the derived approach could bring about the exciting possibility of a symbolic read-out of a learned algorithm from the network weights.”

So, at least in their architecture of a dynamic system, you do know what is going on under the hood.

Another paper, this one by a neurobiologist named Dean Buonomano, shows that any set of neurons that include recurrent connections will have its own preferred trajectories, which mostly are chaotic, but he can force the neurons to learn one of their preferred trajectories to the point that they are not chaotic any more (for that trajectory).  So if you have a sequence of firing patterns of the neurons, even if you perturb that sequence with noise, it will snap back to the sequence.   This means that chaos has been tamed, and you don’t have the problem any more of points at very similar locations diverging dramatically.  The  trajectory of patterns of firing is now predictable, and the region can be used to write words on a screen, if its outputs are trained to do so.  His paper is also interesting because he lists some techniques that are available to understand the internals of such systems of neurons.  Most of those techniques are statistical such as looking at the distribution of weights, so while they show some significant changes after learning, you still don’t quite understand how the information is coded and how it is used.

But Ogma’s work is exciting, as is Numenta’s, and I’m sure we can expect some major advances by both companies.   Both give you their source code, free, to experiment with.

Here is one last example of what Ogma’s software can do, the lower sequence is a learned sequence of prediction of video frames based on training on a video shown in the top sequence.

movieframes

Sources:

Feynman Machine: The Universal Dynamical Systems Computer by Eric Laukien, Richard Crowder, Fergal Byrne (https://arxiv.org/abs/1609.03971)

Symphony from Synapses: Neocortex as a Universal Dynamical Systems Modeler using Hierarchical Temporal Memory – by Fergal Byrne (https://arxiv.org/abs/1512.05245. Most of the  pictures in this blog post come from that article)

Robust timing and motor patterns by taming chaos in recurrent neural networks by Rodrigo Laje & Dean V Buonomano (available on internet)

A modular architecture for transparent computation in recurrent neural networks by Giovanni S.Carmantini, Matheiu Desroches and Serafim Rodrigues (available on internet)

The “Tunnel vision” brain, and the “Big picture” brain.

Iain McGilchrist is a former psychiatrist living on the Isle of Skye, in Scotland. He wrote a book about the two cerebral hemispheres in 2009, titled “The Master and his Emissary“. In that book he describes the two ways of thinking of the two hemispheres as complementary and both necessary.   He believes that the right hemisphere shows “the big picture”. If you have ever heard the saying “To miss the forest for the trees”, that would describe the left hemisphere.
I will list some differences of the left and right hemisphere that he lists.

iain-mcgilchrist-tvo
Iain McGilchrist
  1. The right hemisphere has more “white matter”. This is explained by the axons (communication cables leading from each neuron) traveling greater distances, which makes sense for a “big picture” view.
  2. Neurochemically the hemispheres differ in their sensitivity to hormones (for example, the right hemisphere is more sensitive to testosterone). They depend on preponderantly different neurotransmitters (the left hemisphere is more reliant on dopamine and the right hemisphere on noradrenaline)
  3. Novel experience induces changes in the right hippocampus, but not the left. This squares with the idea that the right hemisphere is attuned to new experiences
  4. The right hemisphere outperforms the left when prediction is difficult. Also, as far as learning new skills, once skills are familiar, they shift to being the concern of the left hemisphere. “The left hemisphere prefers what it knows”, he writes.
  5. The right hemisphere is more capable of a frame shift; and not surprisingly is important for flexibility of thought. If the right frontal lobe is damaged, you get “perseveration”, a pathological inability to respond flexibly to changing situations. For example, having found an approach that works for one problem, subjects seem to get stuck, and will inappropriately apply it to a second problem that requires a different approach.
  6. In problem solving the right hemisphere presents an array of possible solutions, which remain live while alternatives are explored. The left hemisphere, by contrast, takes the single solution that seems best to fit what it already knows and latches onto it.
  7. V. S. Ramachandran’s studies of anosognosia reveal a tendency for the left hemisphere to deny discrepancies that do not fit its’ already generated schema of things. The right hemisphere, by contrast is actively watching for discrepancies, more like a devil’s advocate. These approaches are both needed, but pull in opposite directions.
  8. The right hemisphere takes whatever is said within its entire context. If someone says to you “it’s a bit hot in here today”, your right brain understands he is hinting you should open a window. Your left brain just thinks he has uttered a random observation about the temperature. As you might expect, the right hemisphere underpins the appreciation of humor, since context is important for jokes.
  9. Insight, the sort of problem solving that happens when were not concentrating on it, is associated with activation in the right hemisphere. Insight is also a perception of the previous incongruity of one’s assumptions, which leads it to the right hemisphere’s capacity for detecting an anomaly.
  10. The left hemisphere is the hemisphere of abstraction. Abstraction is an act of removing from context, and making a general concept.
  11. The left hemisphere classifies, while the right hemisphere identifies individuals.
  12. Functional imaging of the brain shows that the left hemisphere takes a “God’s eye” or invariant view of objects, while the right hemisphere uses stored “real world” views.
  13. It is the left hemisphere alone that codes for nonliving things while both hemispheres code for living things. If your right hemisphere is damaged, you can still use tools, but if your left hemisphere is damaged, you can’t even use a key in a lock, or a hammer with a nail.
  14. The right hemisphere plays an important role in what is known as ‘theory of mind’, a capacity to put oneself in another’s position. It reminds me of the supposed ancient Indian saying “Don’t judge a man, until you have walked a mile in his moccasins”.
  15. The right hemisphere identifies emotional expression of the face, and the tone of voice (vocal intonation or prosody). The left hemisphere reads emotions by reading the lower part of the face – the mouth, rather than the eyes. “A patient of mine with a right temporoparietal deficit asked me ‘Whats all this with the eyes?’. When I asked what she meant, she explained that she had noticed people apparently communicating coded messages with their eyes, but could not understand what they were.”
  16. The right frontal lobe is much more important in most emotional expression, but is not superior to the left in one emotion – anger.
  17. Left anterior lesions are associated with depression, right anterior lesions associated with ‘undue cheerfulness’. ‘anterior’ means front, in this case, and in truth, things are more complicated, because contradictory things happen as you move back in the lobes. Confirming this observation: if a patient with depression responds to antidepressants, the left anterior lobe starts functioning better.

The corpus Callosum is a cable of neurons that connect the two hemispheres, both with excitatory connections and inhibitory connections.

The Corpus Callosum both inhibits and excites, and Iain explains that as follows. Co-operation requires difference, not more of the same. “It is not cooperation for the surgeon and the assistant both to try to make the incision…Think of the two hands of a pianist – they must cooperate, but they must also be independent.” Or think of two singers harmonizing together.

It is interesting that there are children who only have one hemisphere, and grow up to be normal.  Relevant to that might be this fact:

The left hemisphere inhibits the right hemisphere on some tasks. The inhibition signal is sent via the Corpus Callosum . If that inhibition goes down (perhaps due to a distracted left hemisphere), we find the right hemisphere can do things that we normally expect the left to be better at, for instance, understand abstract words.

V. S. Ramachandran (he is the director of the Center for Brain and Cognition at UC San Diego) has used the notion of layered belief — the idea that some part of the brain can believe something and some other part of the brain can believe the opposite (or deny that belief) — to help explain anosognosia. In a 1996 paper, he speculated that the left and right hemispheres react differently when they are confronted with unexpected information. The left brain seeks to maintain continuity of belief, using denial, rationalization, confabulation and other tricks to keep one’s mental model of the world intact; the right brain, the “anomaly detector” or “devil’s advocate,” picks up on inconsistencies and challenges the left brain’s model in turn. When the right brain’s ability to detect anomalies and challenge the left is somehow damaged or lost (e.g., from a stroke), anosognosia results. for instance Ramachandran tells this story:

I saw a lady, not long ago, in India, and she had complete paralysis on her left side… I said, “Can you move your left arm?”  She said, “Yes.”  “Can you touch my nose?”  “Yes, I can touch your nose, sir.”  “Can you see it?” “Yes, it’s almost there.”  The usual thing, O.K.?  So far, nothing new.  Her left arm is lying limp in her lap; it’s not moving at all; it’s on her lap, on her left side, O.K.?   I left the room, waited for a few minutes, then I went back to the room and said, “Can you use your right arm?”  She said, “Yes.”  Then I grabbed her left arm and raised it towards her nose and I said, “Whose arm is this?”  She said, “That’s my mother’s arm.”  Again, typical, right?  And I said, “Well, if that’s your mother’s arm, where’s your mother?”  And she looks around, completely perplexed, and she said, “Well, she’s hiding under the table.”  So this sort of confabulatory thing is very common, but it’s just a very striking manifestation of it.  No normal person would dream of making up a story like that.

maxresdefault

Bloggers opinion: After reading the chapter, I think of the mindset of ideologues – they come across as very self-righteous, very convinced of a narrow belief system, rejecting large contradictions or small anomalies. They are not good at detecting  double standards. They don’t appreciate certain types of humor, at least not the kind that points up their own contradictions. It is as if their right hemisphere has gone AWOL in some respects. It is interesting that several comedians don’t want to perform on our increasingly politicized American campuses anymore.

Sources:
The Master and his Emissary – Iain McGilchrist, 2010 paperback edition Yale University Press.
https://opinionator.blogs.nytimes.com/2010/06/23/the-anosognosics-dilemma-somethings-wrong-but-youll-never-know-what-it-is-part-4/
http://www.breitbart.com/video/2015/06/04/seinfeld-comedians-tell-me-dont-go-near-colleges-theyre-so-pc/

Getting past the “Deep Learning” neuron

If you learn about machine learning methods such as “deep learning”, you learn a neural model that does a remarkably simple calculation – it receives inputs from many sources, multiplies each input by a weight, takes a sum of the products, subtracts a threshold, and applies a function.
That model has led to amazing progress in voice recognition, scene recognition, and other areas.
But years have gone by, and now we know more about neurons.  We now know that each dendrite branch can detect sequences of inputs in time.

If neurons 1,2 and 3 all synapse on one dendrite branch of neuron 4, it is possible to show that neuron 4 will respond to the sequence (for example) 2,3,1 ,and not 1,2,3 or 3,2,1.  So neuron 4 might recognize a simple musical melody on each branch (assuming it had enough synapses on each branch from neurons in the auditory areas).

Since a neuron has many dendrites, and a dendrite can have several branches, and each branch can recognize a sequence, you have a computational unit now that can (and I’m sure will) be the basis of new and different types of neural networks.

johnnyneuron

Here are the ingredients of the recipe that allows a dendrite branch to recognize a sequence. (in the above diagram, synapse ‘1b’ is on a different branch than ‘1’.

The following points come from Sergey Alexashenko’s book “Cortical Circuitry

We now know that

1. The neuron has a new type of spike: Up to a point, as you increase the number of inputs on a specific point of a dendritic branch, the inputs are summed linearly, just as the traditional theory would suggest. However, at some point, once a certain threshold is reached, there is a spike in the local voltage. That spike in dendrites resembles a neuronal spike – inputs are summed until they reach a threshold, which triggers a spike. Additional inputs do not significantly increase the magnitude of the response beyond the spike amplitude. The important point here is that we are not talking about the regular neural spike that travels down the axon. That spike of course exists, but we are talking about spikes in dendrites.  These are also called NMDA spikes.

2. The threshold for a spike depends on time: if the inputs impinging on a dendrite via synapses are clustered in time, fewer are required to trigger an NMDA spike than if they are spaced apart in time. This is called ‘cooperativity’, and it works only within one branch of one dendrite. That means that if normally 10 synaptic inputs are needed to trigger an NMDA spike, a recent nearby NMDA spike can lower that threshold to 8 synaptic inputs.

3. Spatial proximity affects the likelihood of a dendrite spike: if for example, you have 10 inputs to the same point on a dendritic branch maybe that would trigger an NMDA spike, but 20 inputs are required if the inputs (synapses) are distributed along the length of the dendrite.

4. The threshold for making a dendrite spike varies along the length of the branch: For instance, it was found that the threshold for initiating a dendritic spike increases 5-fold from the tips of dendritic branches to the parts of the branches close to the soma. But the effect of the spike on the electrical charge of the soma increases 7-fold in the same direction. In other words, spikes that are close to the soma are a lot harder to trigger, but have a stronger effect on the cell body and so are more likely to set off an action potential.

5. The cutoff time before inputs don’t reinforce prior inputs varies spatially: Another way of saying this is that temporal summation increases as you go along the dendrite away from the soma – inputs that are close to the soma have a stronger effect when they are synchronized, but inputs on the tips of dendrites can be summed up without loss over relatively long periods of time.

So how does this lead to sequences? For one thing, neurons are more likely to fire when inputs arrive in a sequence approaching the soma, rather than when they arrive in a sequence moving further away from the soma. That makes sense – impulses arriving far away from the soma take more time to travel to the soma, so if impulses arrive on a dendritic branch in an order approaching the soma, they will arrive in a synchronized fashion, thus making peak voltage higher and increasing the probability of the neuron firing.
But apart from that, in the picture of the neuron above, if an input comes into synapse 1, then synapse 2, then 3, with each input triggering an NMDA spike, then the neuron is more likely to fire an action potential that travels down the axon and causes neurotransmitters to impinge on other neurons. Looking at a dendrite branch in the illustration, we see that the threshold out at point 1, which is far from the Soma, is lower than the threshold at point 3. So it is relatively easy to fire a dendrite spike at point 1. That spike will make it easier to trigger another spike at point 2, and the spikes at point 1 and point 2 will finally make it easier to trigger a spike at point 3. Remember that the dendrite spike at point 3 has a much larger impact than a spike at point 1 for getting the neuron as a whole to fire.

Sergey explains all this, and gives an example, in chapter 4 of his book.  He also has a model of how the cortex works in the later chapters of the book.

Sources:
Alexashenko, Sergey. Cortical Circuitry (Kindle Locations 405-415). Cortical Productions. Kindle Edition.
A company that already has a product based on neurons that detect sequences is Numenta. It works differently than the model Sergey talks about, but it is a good site (numenta.org) to visit and is on a research frontier that you can participate in.

Learning about the mind from Stephan Guyenet’s “The Hungry Brain”

The Hungry Brain is a new (2017) book by Stephan Guyunet on how to “outsmart” the instincts that make us overeat. “Outsmart” is a good word, because he doesn’t advocate pure willpower as the answer. There were a few observations that I found of special interest.

1) Leptin is a protein that’s made in the fat cells, circulates in the bloodstream, and goes to the brain. It is a way of telling you that you are not underweight. Some people have leptin deficiency and

Unlike normal teenagers, those with leptin deficiency don’t have much interest in films, dating or other teenage pursuits. They want to talk about food, about recipes.

Starvation has a similar effect in normal people. Patients who were put on a semi-starvation diet for months were hungry of course, but also:

their conversations, thoughts fantasies and dreams revolved around food and eating. They became fascinated by recipes and cookbooks. Their mental lives gradually began to revolve around food.

This indicates to me that people’s drives and emotions at any particular point in time drive thoughts and interests and not just the other way around.

2) Stephan also talks of the finding in obesity that the hypothalamus was inflamed, and he says that this made sense.

Previous research had already implicated chronic inflammation in insulin resistance–a condition in which tissues like liver and muscle have a harder time responding to the glucose controlling hormone insulin -and this process had already been linked to increased diabetes risk.

So it wasn’t a major leap to suppose that there was resistance to leptin as well in inflammation, and therefor more leptin would be needed to inform the hypothalamus that the body is not starved, and therefore more fat cells (or bigger fat cells) would be needed to create that leptin.
Even worse, but the inflammation (and other damage) in the hypothalamus is caused by too much fattening foods,  (perhaps by putting too much leptin in the blood that overwhelms the leptin receptors in some way).
So there is a positive feedback in obesity.
This means that people who already eat too much are harming a part of the brain that would normally say “enough”, so that their normal weight set point increases (like the set point of a thermostat being turned up).

I asked the author via email why, if there is damage to the hypothalamus, there isn’t damage to other parts of the brain.
He gave this idea as a possibility:

Parts of the hypothalamus (the parts that get inflamed) have a leaky blood-brain barrier, presumably because the hypothalamus is designed to sense the metabolic state of the rest of the body. So anything that’s circulating in the bloodstream can impact the hypothalamus more than other parts of the brain, e.g. if a person overeats and experiences excessive circulating levels of nutrients.

I also asked him if a damaged hypothalamus would affect other processes in the body. He gave this interesting answer:

The hypothalamus regulates many things, and many are linked to energy status. Four of them that are altered in obesity are blood glucose regulation, sexual maturity onset, reproductive function, and blood pressure regulation. There is some evidence that the neuronal changes that accompany obesity can contribute to poor blood glucose regulation. Sexual maturity onset is linked with leptin levels and this probably explains why puberty onset has been getting earlier lately as the population has been gaining fat. …There is evidence that obesity-related hypertension is caused by excess leptin acting in the hypothalamus, so that could be an indirect effect of hypothalamic inflammation (hypothalamic inflammation -> leptin resistance -> fat gain -> high leptin -> hypertension).

An acquaintance of mine who was dramatically overweight had an operation to make his stomach smaller, and not only did he lose weight, but his diabetes was cured. I found this hard to believe at the time, but after reading the “Hungry Brain”, it starts to make sense.

51qskvwzxal-_sx342_ql70_

A free program from Northwestern University that understands your drawings

Cogsketch is a program that understands sketches (the type the people draw with pen and paper). You can download it for free, from the Qualitative Reasoning Group at Northwestern University.
Cogsketch is an impressive achievement, one reason being that people draw things in many different ways, and make mistakes (which Cogsketch can often advise them in how to correct)

Suppose a teacher uses CogSketch to draw a solution to an exercise, such as drawing the layers of the Earth.     For example, he might draw three circles to start with and label the innermost circle “Inner Core” and the outermost circle “Crust.” He would also provide labels, along with labels for the other parts and distractors (here, “Lava” and “Rock)

Fig. 1.  

concentriccircles

Cogsketch then automatically constructs a variety of relationships between the visual entities.

For education, having students explicitly label their sketches with their intended concepts is important since an expert seeing an unlabeled student sketch like Fig. 1 might think it is correct because he or she is interpreting the circles differently from the student. In drawing the layers of the Earth, for example, the sketch of a student who swapped the mantle and crust would look the same as the sketch of a student who got it right: Both sketches consist of a set of concentric circles. Conceptual labels allow for correcting this kind of error.

Humans see objects in a qualitative way.  For instance, rather than say the ceiling is 20 feet above the floor, we just say it is high above the floor.  When Cogsketch interprets sketches, it relies on qualitative judgements including topology.  (Two objects are said to be topologically equivalent if one can be elastically deformed into the other. For instance, a clay doughnut could be reshaped into the shape of a coffee cup, without tearing the clay apart in any way.)

Cogsketch also recognizes hierarchies of items that make up a picture (for instance, after recognizing edges, it can recognize the object the edges are part of).

To give an idea of what is involved just in interpreting edges (the lowest hierarchy), here is a table of edge properties (I abbreviated it):

Edge Attributes Edge Relations
Length:
•      Tiny
•      Short
•      Medium
•      LongCurvature:
•      Straight
•      CurvedArc length:
•      MinorArc
•      Semicircle
•      MajorArc
•      EllipseOrientation:
(several, not given here)
Adjacency:

•      connected

•      intersect

•      intersects

Relative orientation:
•      parallel
•      perpendicular
•      collinear

Positional:
•      rightOf
•      above

Cycle angles:
•      convexCorner
•      concaveCorner

Adjacent corner relations:
•      cycleAdjacentAngles
•      acuteToObtuse
•      obtuseToAcute

 

Cogsketch uses a program called the Structure Mapping Engine to find important differences between the solution and student sketch and to offer advice based on those differences. SME takes as input the solution sketch and the student sketch and creates a mapping between the two. Each mapping consists of
(1) correspondences, which indicate how things in the solution correspond to things in the student sketch;
(2) candidate inferences, which indicate important differences between the solution and the student sketch.

The correspondences and candidate inferences are used to determine what advice should be given to the student.
greenhousesketches
Fig. 5. Greenhouses. These three sketches all are valid solutions to the Greenhouse effect worksheet. CogSketch views them as equivalent because they all satisfy the properties the author specified as important.

Look how different the pictures are, and yet Cogsketch sees that they are the same, on a deeper level, paying attention to as what the teacher thought was important when the teacher drew a prototype sketch.

In future, the team at Northwestern University want to better handle images with entities such as force fields and air masses. An example would be the changes in flow orientation as air moves around an airplane wing.

They also want to put in 3D representation and reasoning capabilities in CogSketch, to better interpret sketches about surfaces and complex shapes, which often occur in engineering, biology, and geoscience. Since sketches are 2D, much can currently be done by crafting advice based on 2D visual properties, but potentially more 3D representations could enable students to choose a wider variety of viewpoints when sketching.

Sources:
For more info, including papers on analogical reasoning in general, take a look at: http://www.qrg.northwestern.edu/papers/papers.html

Numenta and Spaun – A hobbyist’s guide to reverse-engineering the brain.

Neural net models that are loosely inspired by animal nervous systems have been around for many years.   They are made up of many nodes (neurons) that do a computation of some kind on the sum of the signals coming in to them.   The incoming signals travel along connections that are weighted, so some signals are magnified, some are reduced.   Signals on a negatively weighted connection inhibit the neuron they come into.

Weights can be thought of as an encoding.  For instance, let’s say one neuron represents yellow and another blue, and they both feed to a third neuron that weights them equally.  The resulting neuron might then represent “green”.  

If the weights on all the connections of the net are tweaked, you can get the net as a whole to recognize patterns.   For example, it might output a ‘1’ every time you present its input layer with a picture of a zebra.

Nets can learn by supervision, where they are given a set of patterns with known answers and you can correct for the discrepancy between their initial predictions and the real answer by changing all their weights.  They can also learn by unsupervised self-organization where they might simply be given images and learn characteristics of images.

In recent years there have been two interesting theories which come with software that anyone can download from the internet and experiment with.   One theory is the Neural Engineering Framework (NEF)) from Chris Eliasmith’s team at the University of Waterloo in Canada. They supply you with a simulation environment called Nengo – you give it the function you want it to compute, and it organizes groups of neurons to compute it in a biologically realistic way.   Or you can devise neural circuits and see what they can do. 

eliasmith
Chris Eliasmith

A second theory called Hierarchical Temporal Memory comes from Jeff Hawkin’s company, “Numenta”.  So far they model just one layer of the neural cortex, on the theory that much of brain works on some basic principles, and those principles can be seen working even within one simple layer. Their software is also free to download, and is called “Nupic”.


Both NEF and HTM theory assume a high dimensional space with sparse signals.  (A low dimensional space is easy for humans to visualize, but above 3 dimensions it becomes impossible.) To review some basics: any cube has 3 dimensions with an x-axis, a y-axis and a z-axis.   If there are only a few data points in the cube, the data is sparse.  A vector can be looked at as an arrow that starts at the origin (the point 0,0,0) and reaches out to any point in the cube.   A vector with more than 3 numbers has more than 3 dimensions.   Very high dimensional spaces have counter-intuitive properties.   For instance, evolution in biology has puzzles that only make sense when you look at gene mutations as exploring a high dimensional space. Numenta uses “Sparse Distributed Representations” (SDRs), which are basically high dimensional vectors of 1’s and 0’s (you can think of 1 as ‘on’ and 0 as ‘off’), where perhaps only 2 percent of the bits are 1.

Initially Numenta used their model to predict sequences, but recently it occurred to them that when you explore an object, say by touch (with your eyes closed), you may only handle small parts of the object at a time, but you get the general idea of its shape from the sequence of touches.  With your eyes open you still explore objects sequentially, this time by vision, you just don’t realize it.  Your eyes jump rapidly from place to place in a scene, assembling a model of it.   So Numenta has been making exciting headway on that front, and for a clear explanation of what they are attempting, see the IEEE article at:

http://spectrum.ieee.org/computing/software/what-intelligent-machines-need-to-learn-from-the-neocortex

hawkins2


Numenta’s learning algorithm is interesting also in that it doesn’t have connections that might start out weak and then get stronger.  Instead, they look at synapses as ‘either or’.   Either you have two neurons that are linked together by a growing synapse strongly enough so that they communicate, or they do not communicate at all.   This is based on the observation that synapses grow and fade in the brain.   So in the model, you might start out with a weak “permanence” value, and if it increases enough so that its value gets above a certain threshold, then you have a connection.
   HTM theory has two self-limiting influences, it forces a maximum on the number of cortical columns that are active any time, thus keeping the representation sparse, and it boosts any column that over a time period has very little activity, and conversely it inhibits any column that is active too often.  By this balancing act, it increases the likelihood that every column in the cortex is involved in at least one representation.
.
At this point I will only give some brief highlights, because both theories are clearly explained (usually without any difficult math) by papers and videos from both groups (I give links to their explanations below).
Eliasmith’s group has used its theory to build Spaun, the world’s largest functional brain model, using 2.5 million neurons to perform eight different cognitive tasks. For example, you can give it a picture of a number and it will recognize the number and write out the number in script via a simulated 6-muscle arm. (soon a six million neuron version will be available for download). Spaun is just one of many models built with Nengo.

Two interesting insights they had were these. 

1.       A number can be represented, not just by one neuron, but by the average activity of a group of neurons.  Even though each neuron either spikes or does not, outputting a ‘1’ or a ‘0’, the average of these 1’s and 0’s can be any fractional number. A scalar number is one dimensional, but you can represent two (or higher) dimensional numbers as well.  For instance, you can have a group of neurons where each one represents a direction (North, South, West, East, Northwest, etc.) and each one fires most strongly in a preferred direction, somewhat less strongly in a nearby direction, and not at all in the opposite direction).  If you have enough neurons to cover many directions, the activity of the group could represent an incoming signal that had perhaps a direction and a magnitude.   (The formal term for the response curve for a neuron (for instance a neuron that responds best to the color red’ is its ‘tuning curve.’, In this example each neuron is tuned to a different orientation). 

The output of the group can recreate the input. To do this have to sum the outputs of all the neurons with optimal weighting of the output of each.  For instance, if input ‘x’ feeds neural group A, weights coming out of ‘A’ can recreate input ‘x’.   Then you can weight ‘x’ again to produce some other function, such as the square of the input, or you can combine them with outputs from other neural groups and feed them into a third neural group.   In the illustration below, the picture on the left shows that the input x is recreated from neural group A by using weights to combine the outputs of neurons in it, and then x is used again to create neural group B.  The illustration on the right shows that the intermediate step (of recreating x) is not necessary, with the proper weights you do the equivalent of creating ‘x’ and then creating B from ‘x’.    We can think of the connections between A neurons and one of the neurons of ‘x’ as a vector (since its a collection of numbers) and we can extend that to think of the connections of ‘A’ to all the neurons of ‘x’ as a matrix (or set of vectors stacked in rows).   In fact, much of what NEF does is linear algebra.
intoout

2.       One weakness of neural nets is that they don’t represent composite objects.   A chair, for example, is a composite object with legs, a seat, and a back.   However, Eliasmith’s group has looked at existing papers on the topic of “Vector Symbolic Architectures” which explain how to make a vector in high dimensional space represent such an object.  Suppose you had an arbitrary vector that represents the word ‘DOG’, another that represents ‘CAT’ and a vector that represents the part of speech ‘VERB’ and a few other parts of speech.   You can combine all these vectors with concatenation to represent the sentence “Dogs chase Cats”, but eventually your resulting vectors would get very large and very high dimensional.  There are other ways to combine the parts of speech and the words so that your vectors do not get larger.  You get a new vector, which does lose some information, but you can decompose it into its parts despite the loss.  Because of the loss of information, there will be noise, but if you have a memory of what the vectors for “Dog” and “Cat” look like, you can clean-up the noise and get your components back.  Furthermore, the vectors maintain similarity, so that if “pink” and “red” have similar vectors, then “pink square” and “red square” will also have similar vectors. 

“Vector Symbolic Architectures” allow for inductive reasoning as follows:  You may remember puzzles where you see a series of pictures and try to guess what the next picture should be.   It turns out that if these pictures are represented as vectors of symbol combinations, you can mathematically find the transform that leads one picture to the next, and then when you get to the final missing picture, just average out all those transforms (which themselves are vectors or matrices) and you will get a transform that will likely predict the correct missing picture.

You can read more on this at:

http://compneuro.uwaterloo.ca/files/publications/stewart.2012d.pdf
The University of  Waterloo group also offers a course, and the notes are at:
http://compneuro.uwaterloo.ca/research/syde-750.html
You can get the program for free from http://nengo.ca

Numenta has a great explanation of their cortex simulation at:
https://numenta.com/assets/pdf/biological-and-machine-intelligence/BAMI-Complete.pdf

If you prefer videos, they have a series that explains the theory well at:
https://numenta.org/htm-school/

It would be interesting to see if these two theories could be combined in some way.  In the “Vector Symbolic Architectures”, you can represent the sentence “Birds fly” as “Birds * Position#1 * Noun + Fly * Position#2 * Verb” (the asterix is an operator called circular convolution).  The sentence can be decomposed into vectors that stand for its elements (such as ‘bird’).  An object such as a “chair” could be decomposed into its 4 legs and its back and seat.  So we could look at SDRs in HTM-theory and see if they too can be decomposed into elements.
HTM-theory can create a 3D model of an object when a cortex model is combined with a motor signal and a sensory signal. The motor signal says what to explore at any point (like a motor signal to move the eye to focus on a particular point on a sculpture that you are looking at, and a sensory signal that says what the exploration is finding (the resulting signal from the new position where the eye is focusing). In this case a 3D model of the sculpture (in object-centered coordinates) will be built by the cortex. Hawkins says the representation is not like an image, but more like a 3D CAD model (CAD is computer-aided design).  So I asked him (in a forum) if a hierarchy of elements could also be represented, as they can in “Vector Symbolic Architectures.” This is what he wrote:.

Our new work on sensory-motor inference does move closer to your “composite” objects goal. Basically, in our model, objects are defined as a set of features at different locations on the object. The “features” are just SDRs and could in theory represent anything, such as another object.
So far we have been modeling one or more columns in a single region, that is, no hierarchy. In these models the only “features” that can be associated with a location are pure sensory features. I think we would need to have at least two levels in the hierarchy to achieve a composite object as you envision them to be. But the mechanism supports compositional objects.

So there are likely areas for additional discoveries that will come out of Numenta’s model. As for Eliasmith’s model, its safe to say that more discoveries will come out of their model too. For instance, one could attempt to study neural disorders (and possibly mental disorders), by altering the behavior of their neurons. They model a part of the limbic system (the basal ganglia) that, when it goes wrong, can lead to motivation disorders such as addiction, and when it works correctly can decide what to pay attention to – so (my guess is) this could lead eventually to a computer modeling a “train of thought”.

Both projects are worth exploring, and what’s great about the internet is the software is free and that you can participate.  They have forums which they monitor, for instance if you are interested specifically in the sensory-motor advance of Numenta, you can ask questions at: https://discourse.numenta.org and if you are interested in the Neural Engineering Framework, you can ask questions at: https://forum.nengo.ai/.   

That’s all for this post, but for people interested in side-details: this is how both models handle words and language as inputs:
Both models have to be attached to a source of inputs, that end up as a series of numbers (or vectors). In fact, it could be argued our brain receives ALL information about the world as a series of numbers – via our vision, hearing, and so forth.
Vectors for words do not have to be arbitrary strings of on/off bits. One way to make them less arbitrary is to use a method called “Latent Semantic Analysis” that analyzes many documents, and finds that certain words occur together in certain types of documents, and so creates vectors for each word that preserve relationships.  For instance, the vector for “Alzheimer’s” would be similar to the vector for “Plaque” because the two words tend to occur in the same documents.  Eliasmith and Peter Blouw extended that idea to include some context information in the vector (such as position in the sentences where these words are encountered).
In the human brain, it was found with MRI that concepts really are organized semantically (and often physically).  Here is one semantic map (you can see gallantlab.org/semanticmovies for more details):

semanticchart

Numenta’s software also needs inputs that are organized by meaning in some way. Even if you give it a series of numbers, you should encode those numbers as a series of bits so that the number ‘5’ is closer to ‘6’ than it is to ‘8’. For language inputs, they use a company (cortical.io) that makes a “semantic map” by looking at documents. Each word actually has coordinates in the map, and that means it can be represented by numbers. And they can handle sentences too: a vector for a sentence just adds up the vectors for those words. The summed vector doesn’t get too clogged with ‘on’ bits, because they force a limit arbitrarily on the number of bits that can be on. This can be done because often only a few bits are enough to recognize (or at least be reasonably confident of) the entire representation. So getting rid of bits is not as problematic as it might sound. A much better explanation than I can give is at: https://www.youtube.com/watch?v=HLuRQKzYbb8. And there are some interesting applications of their software mentioned on their website.

Does feeling come before reason?

Suppose you are reading an article, and suddenly you hit a sentence that bothers you.  For instance, let suppose you are a psychologist who reads a finding that genetics affects behavior, and since you are very anti-eugenics, a feeling of dismay arises in you.  (This could happen if you felt that the implication of this finding means that different ethnic groups might have behavioral differences, and further, that some groups might have worse behavior than others.)

But the feeling comes fast, and you then feel the need to explain why the fact is wrong.  Without the feeling that started the process, you would have no desire to delve deeper, or to find reasons why the assertion is wrong.   Sometimes this kind of response is dismissed as rationalization of emotional bias, but I don’t think that is always the case.

I was recently reading about semantic-pointer theory, which says that a concept can be represented at several levels, each level more abstract than the one before.   If true, this might mean that you feel a “mismatch” at a high, abstract level, which could be indeed real, and then you search for the explanation of why you feel so uneasy with an assertion you just encountered.   The mismatch that you sensed could be with a deeply held ethical belief, but it could also be a different type of mismatch, a mismatch with knowledge that you have.

In 1957, Leon Festinger proposed that human beings strive for internal psychological consistency in order to mentally function in the real world. That a person who experiences internal inconsistency tends to become psychologically uncomfortable, and so is motivated to reduce the cognitive dissonance: either by changing parts of the cognition, to justify the stressful behavior; or by adding new parts to the cognition that causes the psychological dissonance; and by actively avoiding social situations and contradictory information that are likely to increase the magnitude of the cognitive dissonance.

But reducing some kinds of inconsistency is actually desirable.  If you see a contradiction between two pieces of information, obviously some thought on your part (or an attempt to find new information), is necessary to get at the truth.

Festinger showed that reducing cognitive dissonance can lead to crazy results.  For instance, there was a religious cult that believed an alien spacecraft soon to land on Earth to rescue them from earthly corruption. At the determined place and time, the cult assembled; they believed that only they would survive planetary destruction; yet the spaceship did not arrive to Earth.

This obviously led to discomfort.

Had they been victims of a hoax? Had they vainly donated away their material possessions?

Rather than believe that, most of the cult ended up accepting the idea that due to their own efforts to spread “light” they had saved the earth.  The aliens from outer space had given planet Earth a second chance at existence.

The logic here may seem crazy, but if you have a fixed point that you do not doubt (in their case it was the destruction of the earth),  then if you encounter data that contradicts that fixed point, it makes sense to find an explanation for the data that leaves the fixed point intact.   In a more sensible scenario, your fixed point might be that light cannot travel faster than a certain speed, and if you encounter data that contradicts that fixed point (such as an experiment that was reported in major newspapers a few years ago), chances are you will doubt it, and double check.  As well you should.

Let’s suppose you are a little company that makes a living by ‘fracking’ and you see a report that earthquakes have increased in areas where there is a lot of fracking.   Your self-interest creates an emotion of dismay, and you will look for evidence that will contradict that observation.    You find out that is it true, but that the main reason for it is the method of disposal of wastewater (injecting deep into the ground), and that another method of disposal can be found.  So here, that emotion you felt has caused a good result.

If you see yourself as a hero, and you display cowardice in a situation, you may feel dissonance, and that may inspire soul-searching.   If you see yourself as a stickler for truth, and find you have fallen for a scam, that too might prompt self-examination.  If you see yourself as decent, but the large numbers of your Facebook friends are sharing a regrettable photo from your wild youth, that too might inspire thought.

The big problem in this type of thinking is in cases where your fixed point (whatever it is) should give way to the unpleasant information your have just been exposed to, but your emotions prevent it from giving way.
One suggestion I want to make in this post is that detecting inconsistencies and trying to resolve them is a good thing.   If the contradiction causes you discomfort and you try to remove that discomfort, you are not necessarily “rationalizing” away the facts.
The other point is that ‘feelings’ may be a unconscious precursor to understanding.  Albert Einstein said something that illustrates this: “But the years of anxious searching in the dark for a truth that one feels and cannot express, the intense desire and the alternations of confidence and misgiving until one achieves clarity and understanding, can be understood only by those who have experienced them.”