Download Webots here, and play with my simulation of an actor-critic reinforcement learner that controls a Khepera robot to undertake phototaxis. You can download my simulation here. Just run the file in the World directory. In the file AC.cpp which is the actor critic controller file, you will need to change the line which tells the directory of the file ActorCritic.cpp ,to the directory in which you have placed this file.
Please tell me if you have any problems running the simulation. Also, I'd be interested to know if you can improve the learning performance of the agent.
The yellow lines show the directions of light sensors, the red lines the direction of distance sensots. Figure 1 below shows an animation of the behaviour of the robot on random initialization of the actor-critic reinforcement learning controller.
Figure 1a. Randomly initialized robot. Double click on the movie to play. You need a quicktime plugin. Also sometimes I need to reopen the page in Safari for it to work properly? The Figure 1b below shows how the variables of the Actor critic change during the simulation. The robots only behaviour to start with is rotation clockwise. Some actor weights increase, as do critic weights. The oscilating eligibilities are due to the two light sensors.
In the second graph below, we see the behavior when the first action of the robot is to back into a corner. The robot gets stuck in the corner because no action results in sufficient change in TD-error to adjust the weights of the actor.
You can see that the strategy that produces the consistantly high value prediction is lost at around 4500 time units, and experiences some problems in several epochs before this. If only somehow the earlier excellent strategy could have been preserved. See how nice and low the TD-error was for several epochs, and see how nice and high the value was for several epochs before, e.g. from 2000 to 3000. What a strategy we lost. We can see example after example of well learned strategies that are lost in unusual circumstances. Some videos will help to make this point.
So as to leave the reader in no doudt that a period of high reward achieved by easy virtue is the most harmful of environments for a simple robot, I will tell the story of a robot who had developed the beautiful yet mediocre strategy of rotating over a light, and moving as quickly as possible to the light upon the lights departure. However, as one can see at around time unit 1500 there is a sudden change in weight, corresponding to the robot getting stuck in a corner with the light. After this, it diverts to a rather ineffectual backwars wall following, but this time, unlike previous wall following agents, it is not able to reliably estalish that directional pointing of light sensors to make this more daring strategy worthwhile. Despite large changes to the critic weights, the actor weights never experience much further change. Eventually the poor machine is stuck in a corner. All activity drives its wheels backwards as hard as possible into the corner. It is merely at the whim of the environment. No weight change occurs because no action is associated with increased performance. The preservation of a previous habit, almost irrespective of the quality of that previous habit, would have been better than this ignoble fait.
Thus it is encumbent upon us to devise a wiley scheme, a devious and Tolpudalesque Siluvian Rabalade, a transmogrifying interlude in the hitherto uneventful evolutionary history of robot mind that once surcame to the haenous stability plasticity dileama with little memory of the dileama despite much wrangling at the time. We institute a policy of neuronal natural selection, in its most base algorithmic forms, giving no heed to the rigors of Venetian neuronal reality.
The code for simulations above is available here in a large zip file. The plan is undeceptively simple. At a first attempt, let us store a copy, an exact copy for now, of an actor-critic controller, at regular intervals in time. With each of these actor-critic copies is stored a record of how effective it was, i.e. the fitness (accumulated reward) associated with that unit when it was copied. It may be necessary to modify this exact measure, or even to make it a multi-dimensional record, yearbook at that controllers trials and tribulations. Now, a stored controller is fixed, its weights are never changed, it is merely remembered. If it so happens that the current controllar looses efficacy, e.g. if there is a noticable drop in the cumulative reward of several 10000 units, then the existing controller is copied elsewhere and the active unit is overwritten with a fitter controller stored in the past. Thus, historical habits can come to the rescue, and be modified subsequently. It is thus possible to generate a phylogenetic behavioural tree, no less, and this is what we undertake below.
The solution submitted to GECCO 2010.
A solution has been proposed, and submitted to GECCO 2010. The submitted paper is available here. The Webots code used to generate the data for this paper is available here. Just run one of the World files (2, or 3). The mathematica file for analysing the data files is available here.
The basic principle is that copies are made of the actor-critic controllar, which are then copied back to the active site if the current active controllar is dysfunctional for one of the above described reasons. This is a very interesting possible role for neuronal replication.