Chrisantha Fernando

Webots Simulations

 

Introduction

Download Webots here, and play with my simulation of an actor-critic reinforcement learner that controls a Khepera robot to undertake phototaxis. You can download my simulation here. Just run the file in the World directory. In the file AC.cpp which is the actor critic controller file, you will need to change the line which tells the directory of the file ActorCritic.cpp ,to the directory in which you have placed this file.

#include "/Users/ctf20/Documents/GECCO_2010/SC6_AC_RL_POP_NS/ActorCritic.cpp"

Please tell me if you have any problems running the simulation. Also, I'd be interested to know if you can improve the learning performance of the agent.

The yellow lines show the directions of light sensors, the red lines the direction of distance sensots. Figure 1 below shows an animation of the behaviour of the robot on random initialization of the actor-critic reinforcement learning controller.

Figure 1a. Randomly initialized robot. Double click on the movie to play. You need a quicktime plugin. Also sometimes I need to reopen the page in Safari for it to work properly? The Figure 1b below shows how the variables of the Actor critic change during the simulation. The robots only behaviour to start with is rotation clockwise. Some actor weights increase, as do critic weights. The oscilating eligibilities are due to the two light sensors.

In the second graph below, we see the behavior when the first action of the robot is to back into a corner. The robot gets stuck in the corner because no action results in sufficient change in TD-error to adjust the weights of the actor.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Another interesting pathology is that the behaviour of the AC may for many epochs be EXCELLENT, however, if there is one epoch in which the robot experiences a new configuration of the light source, e.g. above an obsticle, then a radically different behaviour may be useful for achieving high fitness in that new configuration. The old configuration can be lost (weights changed rapidly) and once this epoch is over, the previously learned behaviour that was useful in the more common configuration of light (i.e. in open space) is lost. One way in which we might expect multiple modules to be helpful is in the FREEZING of weights in a previously good module when it turns out to suddenly be associated with poor reward. The problem is, in the new configuration, there may be very high rewards initially, but these should not cause the agent to modify the ORIGINAL strategy, and loose previously learned habits. The kind of environment that is dangerous to performance are those that provide high rewards in rare settings. This is the stability-plasticity dileama. One solution is to reduce the learning rates so that brief periods of RARE high reward events are not sufficient to destabilize longer periods of standard light aquiring behaviours. However, there may be a way around this by using a multiple controllar architecture.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

A rather brilliant strategy was to follow walls with its back and fix itself into corners closest to the light. This resulted in very high fitness, over several epochs. However, again, after a few strange epochs, this lovely strategy was lost.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

You can see that the strategy that produces the consistantly high value prediction is lost at around 4500 time units, and experiences some problems in several epochs before this. If only somehow the earlier excellent strategy could have been preserved. See how nice and low the TD-error was for several epochs, and see how nice and high the value was for several epochs before, e.g. from 2000 to 3000. What a strategy we lost. We can see example after example of well learned strategies that are lost in unusual circumstances. Some videos will help to make this point.

The graph below shows a strategy that is learned for finding the light within the first few epochs. However, it is found to be inappropriate when the agent is behind the box, it gets stuck, however, it keeps carrying out this useless strategy when it should perhaps give up, save that strategy for later and use another one. The question is, given its existing state, can the agent KNOW that it should switch strategies? In this case, the strategy could be regained after the strange environment was escaped. between 1600 and 1700 we see the strategy fails. The video shows that this is because the robot was trapped behind a box. In this circumstance it would have been better to stop exploitation of the previously useful strategy and to expand ones search to explore a new set of strategies that might be useful in this context. In the movie the light is moved to behind the box, the robot gets trapped there when the light is moved elsewhere. A strategy that WAS helpful is no longer helpful. In a sense, that strategy was a local optimum. In a sense, all strategies are local optima. And what is interesting is that local optimal strategy can be stored. DOWNLOAD THE VIDEO HERE. Actually, it is interesting to note that this strategy recovered and improved after this period of poor performance, see attached VIDEO HERE of the behaviour about 50 epochs afterwards. We ask, will this excellent habit be lost?

In fact, a period of rapid "improvement" occured shortly afterward, SEE VIDEO and graph. The robot tended to remain facing the light, rather than rotating around it. This was associated with a higher fitness IN THE SHORT TERM. However, when the light is moved, it was found that the robot has lost the ability to rapidly TRACK to the light in the distance. However, it is interesting that this behaviour actually results in a higher fitness because of the highly directional properties of the light sensors. If they can point accurately towards the light then this gives them high fitness.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

The graph below shows the improvement due to the newly discovered strategy.

 

 

 

 

 

 

 

 

 

 

 

Also, the behaviour persists for many epochs. It doesn't look as good, but it has a much higher fitness.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Note that the TD error is much lower on the whole, the accumulating reward (leaky integrated reward) is higher, and the variance in predictions much lower, and the total reward higher. It seems that we have reached a cieling in possible fitness with this strategy! It appears that this is not only a good strategy, it is stable to itself, and the environment, and the learning rules of the agent.

 

Consider another randomly initialized run.

Run 1. The robot rotates without much improvement over epochs and eventually gets stuck in a corner. After this is establishes the now classical backwards wall following stratergy so often observed. However, even within the confines of this most resiliant of strategies, for periods of several epochs performance can deteriorate due to some subtle cause unknown. It defies speculation to posit for one minute, would it be possible to store, by some process of copying or perhaps a process more rightly named replication, a strategy that were in the near future to become subtally flawed due to no misdemeanour perhaps on the part of the robot itself, but merely due to the variety of experience? A habit formed and cemented that was once fragile and maliable, so as to be spared the ravages of continuous updating. For example, between time units 7000 or thereabouts one obseves an appaling demise in value and reward, a noisy time for predictions. There is of-course not much weight change in these periods either. So perhaps storage itself would not have been necessary at these episodes. Storage becomes necessary and helpful where weight change destorys a perfectly good habit. A good habit can servive a period of noisy reward, as long as the reward is not HIGH. A reward that is HIGH for no self-induced virtue is most harmful even to the robotic mind in its most simple of states.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

So as to leave the reader in no doudt that a period of high reward achieved by easy virtue is the most harmful of environments for a simple robot, I will tell the story of a robot who had developed the beautiful yet mediocre strategy of rotating over a light, and moving as quickly as possible to the light upon the lights departure. However, as one can see at around time unit 1500 there is a sudden change in weight, corresponding to the robot getting stuck in a corner with the light. After this, it diverts to a rather ineffectual backwars wall following, but this time, unlike previous wall following agents, it is not able to reliably estalish that directional pointing of light sensors to make this more daring strategy worthwhile. Despite large changes to the critic weights, the actor weights never experience much further change. Eventually the poor machine is stuck in a corner. All activity drives its wheels backwards as hard as possible into the corner. It is merely at the whim of the environment. No weight change occurs because no action is associated with increased performance. The preservation of a previous habit, almost irrespective of the quality of that previous habit, would have been better than this ignoble fait.

 

 

Thus it is encumbent upon us to devise a wiley scheme, a devious and Tolpudalesque Siluvian Rabalade, a transmogrifying interlude in the hitherto uneventful evolutionary history of robot mind that once surcame to the haenous stability plasticity dileama with little memory of the dileama despite much wrangling at the time. We institute a policy of neuronal natural selection, in its most base algorithmic forms, giving no heed to the rigors of Venetian neuronal reality.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

The code for simulations above is available here in a large zip file. The plan is undeceptively simple. At a first attempt, let us store a copy, an exact copy for now, of an actor-critic controller, at regular intervals in time. With each of these actor-critic copies is stored a record of how effective it was, i.e. the fitness (accumulated reward) associated with that unit when it was copied. It may be necessary to modify this exact measure, or even to make it a multi-dimensional record, yearbook at that controllers trials and tribulations. Now, a stored controller is fixed, its weights are never changed, it is merely remembered. If it so happens that the current controllar looses efficacy, e.g. if there is a noticable drop in the cumulative reward of several 10000 units, then the existing controller is copied elsewhere and the active unit is overwritten with a fitter controller stored in the past. Thus, historical habits can come to the rescue, and be modified subsequently. It is thus possible to generate a phylogenetic behavioural tree, no less, and this is what we undertake below.

The solution submitted to GECCO 2010.

A solution has been proposed, and submitted to GECCO 2010. The submitted paper is available here. The Webots code used to generate the data for this paper is available here. Just run one of the World files (2, or 3). The mathematica file for analysing the data files is available here.

The basic principle is that copies are made of the actor-critic controllar, which are then copied back to the active site if the current active controllar is dysfunctional for one of the above described reasons. This is a very interesting possible role for neuronal replication.

 

 

 

About Us | Site Map | Privacy Policy | Contact Us | ©2005 Chrisantha Fernando