Towards Symbolic Processing by Neural Networks: A Coordination Language Game.
We investigate the response of the two coupled Izhikevich (2007) type spiking neural networks to pair training in a simple coordination task. The trial protocol as first designed is shown below. The environmental signal A or B is delivered to approx half the neurons in the region from 0 to 15 on the sender neural network. This signal is a 5mV depolorization of excitatory thalamic input neurons lasting 98ms. The regions representing the output "1" or "0" have their spikes counted between 90ms and 190ms into the trial. If 1 > 0, "1" is uttered, and vice versa at 190ms. This results in activation of the reciever sensory areas in the same was as activation of the senders A/B areas earlier, lasting from 190ms to 290ms. The answer is then detected in the reciever answer areas by counting spikes for "A" and "B". At 680ms if A> B, "A" is uttered. Reward is calculated and given with a characturistic delay of 10ms after 700ms. A new trial is started after 2000ms.

1.(OBSOLETE) 1st February 2008. Language coordination game. 25 x 25 neurons. STDP + reward.
2. 2nd February 2008. Two runs with 30x30 neurons. Same protocol as above. Download run1, run2. Data of all trials, run1 run2.
The two films above give the following information.
The sender neural network is at the bottom, and the reciever neural network is above it. From left to right, the first 4 blue diamond perspective views show the components of weight change that are due to different conditions in the experiment (LTP in Env A, LTD in Env A, LTP in Env B, and LTD in Env B). The next two show the weights and the eligibility traces of these weights. The cyan rectangle above shows a record of the types of trial experienced by the pair of agents. The coding is {Env type (0 = A , 1 = B), Utterance(0 or 1) and Answer (0 = A, 1 = B). We see in the above example run that 011 and 111 are the most common types of trial experienced, i.e. where environment = 0, the sender says 1 and the reciever says 1. Similarly when the environment is 1. This is very poor behaviour with no learning. The response of the reciever is always to say 1, similarly wit h the sender.
Note also that LTD is ubiquitous over all weights, whereas significant amounts of LTP are confined to a subset of weights. NOTE however that the distribution of LTP does not differ much between environment A amd environment B, as we might expect.
3. The system is not learning the task. Several modifications are possible. The protocol differs from Izhikevich (2007) in several respects.
i. Confirm that the background firing rate is < 1 Hz. If not, reduce basal thalamic input.
ii. Iz2007 used randomly connected neurons with 10% connectivity. We used spatial connectivity.
iii. Iz2007 used a maximum axonal conduction delay of 1ms, we use a distance based delay.
iv. Iz2007 capped weights between 0 and 4mV. (excitatory), we have no capping. Check the weight distribution achieved, Iz2007 showed most weights were < 0.1mV at steady state. Check whether reinforced weights are reaching 4mV cap.
v. Iz2007 (Stimulus-Response Instrumental Conditioning) only measured the output regions firing activity up to 20ms after the stimulus. In our experiment we wait MUCH longer to measure this activity, e.g. 500ms-1000ms. The reverbatory activity of the network may well have ceased by then. Change this to measure activity in the 1st 20ms immediately proceeding the stimulus.
vi. Iz2007 seperated each trial by 10seconds so that the activity and eligibility traces could settle down between distinct trials. We only seperate trials by 1second. Try a longer seperation.
vii. In Iz2007, reward was delivered up to 1second after response calculation. The delay was inversely proportional to the ratio of [A]/[B]. If [A] was the desired response.
viii. Iz2007 used only 50 excitatory neurons to input the stimuli. We use 15x15x0.5 neurons for each stimulus. Our output neuron set is also much larger, i.e. 15x15x0.5, whereas Iz2007 chose also 50 neurons for the output set.
4. First modification. The trial is altered to make the experiment closer to v. and vi and vii. above. See below.
The stimulus is given for only 1ms, and recording areas 1/0 and A/B areas is only done for 20ms. Reward is given if the answer was correct with a characturistic time of [B]/[A] if A was correct. Trials now last 10000ms.

Experiment 1.
Experiment 2.
Download movie and trials data. We can analyse the correlations between input, utterance and answer over the course of the trial, i.e. look at the behaviour of each player.
5. Modification 2. We reduce the size of the input and output sets of neurons to be consistant with viii. above.
Download movie and trials data.
6. Modification 3. Probability of connections increased from 5% to 10% (but spatial connectivity remains).
Downloadtrials data. (1), (2).
7. Modification 4. Random connectivity at 10% + 1ms delay for all neurons (as in Iz2007).
Download movie for 6 & 7 and trials data.
Possible reasons for failure. A task decomposition solution
The training requires adaptation of two seperate functons. First the sender must decide to encode A as either 0 or 1, and vice versa for B. Secondly, the reciever must decide to produce output A or B for input 0 or 1. Which of these decisions is correct for the reciever, depends on the decision made by the sender.
If two humans were to do this task, the sender would naturally decide to UNIQUELY label environments with 0 or 1 arbitrarily, without having to recieve any reward from the reciever. The reciever would also ASSUME that a unique labelling had been produced by the sender and would try flipping codings (by systemaic search) until reward was obtained. Thus, there is a huge amount of task independent knowledge that is being applied to the solution of the task that our networks certainly do not have.
We might consider a simpler experiment in which we first only reward the sender for uniquely classifying A to 0 and B to 1, or reward ANY unique classification of inputs by the sender. It is most natural to use a self-organizing map to assign unique labels to distinct classes of stimuli. It would be interesting if this process "self-organized" without having to use task specific reward obtained by downstream circuits. This would simplify the task of the top level reward process.
Implementing self-organizing maps using spiking neural networks with delay.
If independent stimuli (e.g. A and B above) are injected at different times into a Iz2007 type layer, how can it self-organize so as to produce outputs 0 and 1 that "describe" the class into which the input vector falls? We can examine this self-organizing process by examining the receptive fields of neurons when given inputs A and B injected randomly over a long period of time. For each neuron we define a specificity value. Neuronal Specificity = |A|/(|A| + |B|). Depending on the extent of lateral inhibition, I would expect a high specificity to self-organize in a subset of neurons (independent of explicit reward). IF some GATING mechanism existed to select the output set based on those that had the highest specificity, then immediately, this would make the coordination task (and many other such tasks) easier I assume.
9. Experiment. Give A and B stimulus to sender. Measure spontaneous specificity distribution with no (or low) reward. Is there a tendency for lateral inhibition to produce unique encodings of independent stimuli? This could structure the way that the system responds to reward that selects for a particular encoding, e.g. make selection much faster, allowing it to FLIP rapidly between possible encodings.