One of the papers we are presenting at this year's Computational Intelligence and Games symposium is about comparing td-learning and neuroevolution for learning car racing skills.
The paper (go read it!) contains lots of material, and I won't try to summarize the rather dense methods and results sections here. But let me reflect a bit on some of the major conclusions:
First of all, td-learning can be blazing fast when it works like it should. Of course td-learning could potentially be faster than evolution as it learns from feedback during the lifetime of an individual, but we didn't expect it to be quite as fast as it sometimes was. A few times we saw a car controller starting from tabula rasa and going to driving decently between waypoints in a few hundred time steps, maybe 20 seconds of simulated time!
But to balance the picture, td-learning can be a bitch. Really. Performance is completely unpredictable, the very same parameter configuration gives completely different learning results in successive runs, and the same configuration can learn very well sometimes and not at all at other times. Often, already learned good behaviour is unlearned after a few more epochs. Etc, etc. It is simply much easier to learn something sensible with evolution than td-learning. And in the end, the best evolved controllers are consistently better than the best td-learned controllers.
Which brings us on to the question of whether these effects are inherent to the algorithms, or whether they are an artifact of Simon and me being much more familiar with evolution than td-learning. Interesting question. I don't know. We did, however, bring in Thomas Runarsson to help us with the experiments and he's done quite a bit of td-learning in the past.
Another interesting thing that came out of our experiments is how good it is to have a forward model available. Evolving state value functions consistently outperformed direct control. I think the use of forward models might very well be the next big thing in evolutionary robotics. We have a couple of exciting ideas for how to do this, now we just need time to get working on that...
Anyway, that's all for today. Slightly more unstructured than the usual. But so am I.
The car racing competition is now updated, with minor changes to the code, submission instructions, and a hall of fame:
http://julian.togelius.com/cig2007competition/
Please consider participating. It's easy to get started, and really fun! (Hah, like I was ever going to tell you it's hard and boring, even if it was... Really, I've tried hard to make it as easy as possible to enter.)
If you decide to give it a go, I would appreciate a mail stating this intention, so I can put your mail address on a list of people to notify in the unlikely event of changes to the rules, code, etc.
Imagine you were driving a car in a long, dark, tunnel, and suddenly your headlights started flickering, going off and on irregularly, with intervals of a second or so. What would you do? It seems the only way you could keep from crashing would be to accurately remember the bends of the tunnel. For example, if the lights went out just before a left turn, you would have to predict in how long time the turn starts and start turning appropriately.
Now imagine you were driving a radio-controlled car, but due to some low-grade engineering, there was an unfortunate delay between when you issue a command (such as turning left) and the command has an effect (angling the wheels). How would you handle this? It seems you would have to predict the effects of your turning, so that you started and stopped turning slightly before you seemed to need to.
These two situations are the inspiration for a paper we (Hugo, me and Magdalena) are presenting at the 2007 IEEE-ALife Symposium in Hawaii. Essentially, we wanted to see whether we could force our controllers to learn to predict. Of course, we used my good old car racing simulator for the experiments. To remind you, this is what one of our evolved controllers looks like when all six sensors are turned on and current: (The strange lines represent the sensors)
Now, let's turn of the sensors intermittently and see what happens: (No lines = no sensors)
Not very pretty. Can we improve on this? We tried, by recording the car driving around a few tracks and trying to teach neural networks to predict the future (what sensor input comes next, given current input and action taken). First, we used backpropagation for this. Combining such a predictor with the same evolved controller as before looks like this:
Better than before, but not much.
So we tried another thing. Instead of training the predictor networks to predict, we evolved them for being able to help the controller to drive. It might at first not seem like much of a difference, but in fact it is crucial. Look for yourselves:
CLearly much better. And the difference turns out to be not only quantitative, but also qualitative. But before we go into the analysis, let's look at the other task: the delay task. Below is the same good old evolved controller as in the above examples, but with all sensor inputs delayed by three time steps:
Looks like the driver is drunk, doesn't it?
Let's see if we can do something about this. First, we try to predict the current sensory state from the outdated perceptions, using a predictor trained with backpropagation. We then get something like this:
Pretty terrible. The driver went from drunk to stoned.
The next step was to instead evolve a predictor for maximum performance, as we did with the intermittent task above. Again, the result is strikingly different:
So, what's the take-home message from this? That evolution works better than backpropagation for learning predictors? Not so simple. Because when we analyse the various evolved and trained predictors, it turns out that the evolved predictors don't actually do any prediction! In other words, the mean squared error of the predicted next state and the real next state is quite low for the trained predictors, but horribly high for the evolved ones!
So, again, what does this mean? For one thing, the type of neural networks and the data we are using (only one prior state and action) is not enough to predict the next state as accurately as we would have needed. Therefore the predictors we got with supervised learning were not up to the task. Evolution, on the other hand, quickly figures out that accurate prediction is impossible and decides to go for something else. The evolved predictors instead act as extensions of the controller, changing its behaviour so that it copes with the missing or delayed data better. These changes might include slower driving, higher propensity for turning one way rather than the other, or making sure that when bumping into walls, the back end of the car goes first, rather than the front of the car.
At least, this is what we think happens. Let's say that the topic merits further study... please read the paper if you're interested.
I'm not so sure if any of the above made much sense to you, dear reader. Is my habit of trying to summarise the main points of whole papers a good one? Or does it all just become compressed to the point of unintelligibility? Tell me!