Notes on Neural Evolution & Language Learning
RL differs from AE (Artificial Evolution) because the behavioural policy is modified online during the task based on information about rewards obtained during the task.
Reward Function
The aim of RL is to maximize some reward function, e.g. reward obtained over a finite horizon or a receeding horizon or average reward over h time steps. The choice of optimality model matters a great deal, and human skill is often required to choose it for a particular application. Once the optimality criterion AND the function that determines reward r = f(E), which is a function of environmental states, has been determined, it can be used to assess policies for behaviour.
k-armed bandit
Can be solved with 1. Dynamic Programming to do Bayesian reasoning. 2. Learning automata. 3. Boltzman exploration of action space on the basis of expected reward.
Delayed Reinforcement Problems
Markov Decision Processes MDPs. The value V of a state is the expected infintite discounted sum of reward that the agent will gain it if srarts in that state and executes the optimal policy. The optimal policy maximizes the sum of V over all states. The Optimal Value function is that which maximizes value over all states.
One can use value iteration to go through all state action pairs, calculate the value of a pair as the value obtained for the action that returned the greatest reward + the greatest summed value due to all transitions from that state.
Policy iteration starts with a policy and improves it to maximize value obtained.
The above methods assume a model i.e. we know the state transition function and the reinforcement function. However, if we don't the the agent must interact with the environment to produce the optimal policy. In this case one may 1. Learn a controller without learning a model. 2. Learn a model and use it to produce a controller.
The value of a state can be estimated by the immediate reward obtained and the estimated value of the next state. These are TEMPORAL DIFFERENCE METHODS (Sutton, 1988).
Adaptive Heuristic Critic and TD (LAMBDA).
In AHC, the value function is computed by using an algorithm called TD(0). The critic AHC amd the reinforcement learning component (RL) work together and both recieve state information from the environment. Instead of acting to maximize instantaneous rewards, it acts to maximize the heuristic value v, that is computed by the critic. The critic uses the real external reinforcement signal to learn to map states to their expected discounted values given a certain policy being executed. The policy in RL is fixed and the AHC learns the value function for the policy. Then the critic is fixed and the RL component learns a new policy that maximizes the new function, etc.... The value of a policy is learned using Sutton's TD(0) algorithm which uses the update rule:
V(s) := V(s) + a(r + L V(s') - V(s))
Whenever a state s is visited, its estimated value is updated to be closer to r + L V(s').
Why is it better to have a critic than just to use the real rewards that are obtained directly? Its only after reading Barto, Sutton and Anderson (1983) that I realized that the role of the ACE is to construct predictions of reinforcement so that FOR EXAMPLE " IF PUNISHMENT IS LESS THAN ITS EXPECTED LEVEL, IT ACTS AS REWARD."
q-Learning
See the online example here. Q learning consists of the following
![]()
Imagine you have a state action matrix R, and an initial zero state action matrix Q. Q is updated by the existing reward obtained from that state + the maximum reward obtained from all actions taken from the next state discounted by some parameter Y. In the training phase one explores the space updating the Q values, and in the test phase one just follows the maximum Q value route to the goal.
This realy is remarkebly simple, except it is rather inefficient to have to take a random action in a training period.
Both these methods assign explicit values to explicit environmental states or state action pairs. How can this be done in a more realistic neural network form?
How to use AHC or Q-learning on-line with a real neural network task?
Comments on Henry Brighton's DPhil Thesis.
I am reading HBs thesis on the recommendation of S. Kerby in order to try to understand the history of language learning modeling. In particular grammar inferance.
1. I am concerned that the meaning signal mapping from one 2D space to another using the obverter procedure is a very metaphorical implementation of language. Why should compositionality correspond to the formation of a topological map between meaning and signal?
Bayesian Approaches to Grammar Inferance
The Bayseian formulation of the grammar inference problem is described here.
Software for Grammars
1. Simple Deterministic C++ FSM implementation
#include <stdio.h>
main()
{
int c;
START:
switch(c = getchar()){
case 'f' : goto F;
case 'b' : goto B;
case EOF : goto FAIL;
default: goto START; }
F:
switch(c = getchar()){
case 'o' : goto FO;
case EOF : goto FAIL;
default : goto START;}
FO:
switch(c = getchar()){
case 'o' : goto SUCCESS;
case EOF : goto FAIL;
default : goto START;}
B:
switch(c = getchar()){
case 'a' : goto BA;
case EOF : goto FAIL;
default : goto START;}
BA:
switch(c = getchar()){
case 'r' : goto SUCCESS;
case EOF : goto FAIL;
default : goto START;}
FAIL:
printf("Does not match.\n");
return;
SUCCESS:
printf("Matches.\n");
return;
}
NeuroEvo Films (large files for download)
1. 10th December 2007. Download zipped file here.
2. 10th December 2007. Oscillation period = 100000ms. Download zipped file here.
3. 10th December 2007. Oscillation period = 100000x10ms (as above). Tau_z increased from 1000 to 100000ms, i.e. 100 slower increase and decrease in average firing rate of neurons. Beta = 0.04 weight units per mHz difference. Goal rate = (20.0Hz/1000). Download zipped file here.
4. Just one round of copying. Input to bottom layer is strong random thalamic activation. Download zipped movie here.