Monday, August 03, 2020

A very short history of some times we solved AI

1956: Logic Theorist. Arguably, pure mathematics is the crowning achievement of human thought. Now we have a machine that can prove new mathematical theorems as well as a human. It has even proven 38 of the first 52 theorems of Principia Mathematica on its own, and one of the proofs is more elegant than what Russell and Whitehead had come up with. It is inconceivable that anyone could have this mathematical ability without being highly intelligent.

1991: Karl Sims' Creatures. Evolution is the process that created natural intelligence. Now we can harness evolution to create creatures inhabiting a simulated virtual world with realistic physics. These evolved creatures have already developed new movement patterns that are more effective than any human-designed movements, and we have seen an incredible array of body shapes, many unexpected. There is no limit to the intelligence that can be developed by this process; in principle, these creatures could become as intelligent as us, if they just keep evolving.

1997: Deep Blue. Since antiquity, Chess has been seen as the epitome of a task the requires intelligence. Not only do you need to do long-term planning in a complex environment with literally millions of possibilities, but you also need to understand your adversary and take their playing style into account so that you can outsmart them. No wonder that people are good at Chess are generally quite intelligent. In fact, it seems impossible to be good at something as complex as Chess without being intelligent. And now we have a computer that can beat the world champion of Chess!

2016: AlphaGo. Go, the Asian board game, is in several ways a much harder challenge than Chess. There are more moves to choose from, and recognizing a good board state is a very complex task in its own right. Computers can now play Go better than the best human player, and a newer version of this algorithm can also be taught play Chess (after some tweaks). This astonishing flexibility suggests that it could be taught to do basically anything.

2019: GPT-2. Our language is our most important and impactful invention, and arguably what we use to structure and shape our thoughts. Maybe it's what makes thinking as we know it possible. We now have a system that, when prompted with small snippets of text, can produce long and shockingly coherent masses of text on almost any subject in virtually any style. Much of what it produces could have been written by a human, and you have to look closely to see where it breaks down. It really does seem like intelligence.

2020: GPT-3. Our language is our most important and impactful invention, and arguably what we use to structure and shape our thoughts. Maybe it's what makes thinking as we know it possible. We now have a system that, when prompted with small snippets of text, can produce long and shockingly coherent masses of text on almost any subject in virtually any style. Much of what it produces could have been written by a human, and you have to look closely to see where it breaks down. It really does seem like intelligence.

This is obviously a very selective list, and I could easily find a handful more examples of when we solved the most important challenge for artificial intelligence and created software systems that were truly intelligent. These were all moments that changed everything, after which nothing would ever be the same. Because we made the machine do something that everyone agreed required true intelligence, the writing was on the wall for human cognitive superiority. We've been prognosticating the imminent arrival of our new AI overlords since at least the 50s.

Beyond the sarcasm, what is it I want to say with this?

To begin with, something about crying wolf. If we (AI researchers) keep bringing up the specter of Strong AI or Artificial General Intelligence every time we have a new breakthrough, people will just stop taking us seriously. (You may or may not think it is a bad thing that people stop taking AI researchers seriously.)

Another point is that all of these breakthroughs really were worth the attention they were getting at the time. They really were major advances that changed things, and they all brought unexpected performance to tasks that we thought we needed "real" intelligence to perform. And there were many other breakthroughs in AI that could have fit onto this list. These were really just the first five things I could think of.

But we no longer worry that the Logic Theorist or Deep Blue is going to take over the world, or even put us out of jobs. And this is presumably not because humans have gotten much smarter in the meantime. What happened was that we learned to take these new abilities for granted. Algorithms for search, optimization, and learning that were once causing headlines about how humanity was about to be overtaken by machines are now powering our productivity software. And games, phone apps, and cars. Now that the technology works reliably, it's no longer AI (it's also a bit boring).

In what has been called "the moving goalpost problem", whenever we manage to build an AI system that solves (or does really well at) some task we thought was essential for intelligence, this is then taken to demonstrate that you did not really need to be intelligent to solve this task after all. So the goalpost moves, and some other hard task is selected as our next target. Again and again. This is not really a problem, because it teaches us something about the tasks our machines just mastered. Such as whether they require real intelligence.

So when will we get to real general artificial intelligence? Probably never. Because we're chasing a cloud, which looks solid from a distance but scatters in all directions as we drive into it. There is probably no such thing as general intelligence. There's just a bunch of strategies for solving various "cognitive" problem, and these strategies use various parts of the same hardware (brain, in our case). The problems exist in a world we mostly built for ourselves (both our culture and our built environment), and we built the world so that we would be effective in it. Because we like to feel smart. But there is almost certainly an astronomical number of potential "cognitive" problems we have no strategies for, have not encountered, and which our brain-hardware might be very bad at. We are not generally intelligent.

The history of AI, then, can be seen as a prolonged deconstruction of our concept of intelligence. As such, it is extremely valuable. I think we have learned much more about what intelligence is(n't) from AI than we have from psychology. As a bonus, we also get useful technology. In this context, GPT-3 rids us from yet another misconception of intelligence (that you need to be generally intelligent to produce surface-level coherent text) and gives us a new technology (surface-level coherent text on tap).

Lest someone misunderstand me, let me just point out that I am not saying that we could not replicate the same intelligence as a human has in a computer. It seems very likely that we could in the future build a computer system which has approximately the same set of capabilities as a human. Whether we would want to is another matter. This would probably be a very complex system with lots of parts that don't really play well together, just like our brain, and very hard to fine-tune. And the benefits of building such a system would be questionable, as it would not necessarily be any more or less "generally intelligent" than many other systems we could build that perform actual tasks for us. Simply put, it might not be cost-efficient. But maybe we'll build one anyway, for religious purposes or something like that.

Until then, there are lots of interesting specific problems to solve!

Saturday, July 21, 2018


I've been to both the IEEE Congress on Evolutionary Computation (CEC) and IEEE Genetic and Evolutionary Computation Conference (GECCO) many times now, but this year was probably the first time that I attended both of the two major evolutionary computation conferences back to back. This gave me an opportunity to think about their differences and respective strengths and weaknesses.

To begin with, both conferences feature some very good work and quality of top papers at both is comparable. However, the average paper quality at GECCO is higher. This is almost certainly because CEC has much higher acceptance rate. I'm not a fan of artificially low acceptance rates, as I think it discourages risk-taking and all good research deserves being published. However, I think not all papers at CEC deserve to be full papers with oral presentation. There's just too much noise.

Both conferences have invited talks (called keynotes and plenary talks). However, they differ in their character. Whereas CEC largely invites prominent speakers from within the community, GECCO seems to almost entirely source their speakers from outside the community. I've often been puzzled by the choice of keynote speakers at GECCO, but this year was extreme. The speakers had almost nothing to do with evolutionary computation. I understand that it's good with outside influences, but this felt like random talks on random topics. A research community also has a responsibility to help its researchers grow by giving strong researchers an opportunity to shine, and present them as examples to the community. It is my strong opinion that CEC has a much better keynote selection policy than GECCO. (Yes, I'm biased as I gave one of the CEC keynotes this year. But I also enjoyed the other CEC keynotes way more than the GECCO keynotes.)

 CEC has a number of special sessions whereas GECCO has tracks. I think the GECCO model is somewhat better than the CEC model here. The tracks have more of their own identity, and review and paper selection happens on a per-track basis, which is nice. CEC could easily turn the special sessions into something more like tracks, which would probably be a good thing. However, the difference is not large.  (Aitor Arrieta on Twitter points out that it's nice to be able to have special sessions on hot topics, which is true - tracks are a bit less flexible.)

Then there's the best paper award selection policy. Here GECCO is a clear winner, with awards in each track, and the best paper selected by public vote among a handful of top-reviewed papers. This is infinitely much fairer and more transparent than CEC's "selection by secret cabal". CEC, please fix this problem.

Finally, why are there two main conferences on evolutionary computation? Turns out it's for historical reasons, that at least partly have to do with animosity between certain influential people who are no longer that important in the community. I'm not necessarily a fan of always having a single large conference, but especially for US researchers your papers count more if published in a "large selective" conference. With this in mind, I think CEC and GECCO should merge.

(This blog post is edited from a series of tweets. I'm thinking about doing this more often, as blog posts are perceived as more permanent than tweets.)

Sunday, May 27, 2018

Empiricism and the limits of gradient descent

This post is actually about artificial intelligence, and argues a position that many AI researchers will disagree with. Specifically, it argues that the method underlying most of deep learning has severe limitations which another, much less popular method can overcome. But let's start with talking about epistemology, the branch of philosophy which is concerned with how we know things. Then we'll get back to AI.

Be warned: this post contains serious simplifications of complex philosophical concepts and arguments. If you are a philosopher, please do not kill me for this. Even if you are not a philosopher, just hear me out, OK?

In the empiricist tradition in epistemology, we get knowledge from the senses. In the 17th century, John Locke postulated that the mind is like a blank slate, and the only way which we can get knowledge is through sense impressions: these impressions figuratively write our experience onto this blank slate. In other words, what we perceive through our eyes, ears and other sense organs causes knowledge to be formed and accumulated within us.

The empiricist tradition of thought has been very influential for the last few centuries, and philosophers such as Hume, Mill and Berkeley contributed to the development of empiricist epistemology. These thinkers shared the conviction that knowledge comes to us through experiencing the world outside of us through our sense. They differed in what they thought we can directly experience - for example, Hume though we can not experience causality directly, only sequences of world-states - and exactly how the sense impressions create knowledge, but they agree that the sense impressions are what creates knowledge.

In the 20th century, many philosophers wanted to explain how the (natural) sciences could be so successful, and what set the scientific mode of acquiring knowledge apart from superstition. Many of them were empiricists. In particular, the Vienna Circle, a group of philosophers, mathematicians, and physicists inspired by the early work of Wittgenstein, articulated a philosophy that came to be known as Logical Empiricism. The basic idea is that sense impressions is all there is, and that all meaningful statements are complex expressions that can be analyzed down to their constituent statements about sense impressions. We gain knowledge through a process known as induction, where we generalize from our sense impressions. For example, after seeing a number of swans that are white you can induce that swans are white.

A philosopher that was peripheral to the Vienna Circle but later became a major figure in epistemology in his own right was Karl Popper. Popper shared the logical empiricists' zeal for explaining how scientific knowledge was produced, but differed radically in where he thought knowledge came from. According to Popper, facts do not come from sense impressions. Instead, they come "from within": we formulate hypotheses, meaning educated guesses, about the world. These hypotheses are then tested against our sense impressions. So, if we hypothesize that swans are white, we can then check this with what our eyes tell us. Importantly, we should try to falsify our hypotheses, not to verify them. If the hypothesis is that swans are white, we should go looking for black swans, because finding one would falsify our hypothesis. This can be easily motivated with that if we already think swans are white, we're not getting much new information by seeing lots of white swans, but seeing a black swan (or trying hard but failing to find a black swan) would give us more new information.

Popper called his school of thought "critical rationalism". This connects to the long tradition of rationalist epistemology, which just like empiricist epistemology has been around for most of the history of philosophy.  For example, Descartes' "I think, therefore I am" is a prime example of knowledge which does not originate in the senses.

Among (natural) scientists with a philosophical bent, Popper is extremely popular. Few modern scientists would describe themselves as logical empiricists, but many would describe themselves as critical rationalists. The main reason for this is that Popper describes ways of successfully creating scientific knowledge, and the logical empiricists do not. To start with the simple case, if you want to arrive at the truth about the color of swans, induction is never going to get you there. You can look at 999999 white swans and conclude that they are all white, but the millionth may be black. So there can be no certainty. With Popper's hypothetico-deductive method you'd make a hypothesis about the whiteness of swans, and then go out and actively try to find non-white swans. There's never any claim of certainty, just of an hypothesis having survived many tests.

More importantly, though, the logical empiricist story suffers from the problem that more complex facts are simply not in the data. F=ma and E=mc2 are not in the data. However many times you measure forces, masses and accelerations of things, the idea that the force equals mass times acceleration is not going to simply present itself. The theories that are at the core of our knowledge cannot be discovered in the data. They have to be invented, and then tested against the data. And this is not confined to large, world-changing theories.

If I already have the concepts of swan, white and black at the ready, I can use induction to arrive at the idea that all swans are white. But first I need to invent these concepts. I need to decide that there is such a thing as a swan. Inductivists such as Hume would argue that this could happen through observing that "a bundle of sense impressions" tend to co-occur whenever we see a swan. But a concept such a swan is actually a theory: that the animal is the same whether it's walking of flying, that it doesn't radically change its shape or color, and so on. This theory needs to somehow be invented, and then tested against observation.

In other words, empiricism is at best a very partial account of how we get knowledge. On its own, it can't explain how we arrive at complex concepts or theories, and it does not deliver certainty. Perhaps most importantly, the way we humans actually do science (and other kinds of advanced knowledge production) is much more like critical rationalism than like empiricism. We come up with theories, and we work to confirm of falsify them. Few scientists just sit around and observe all day.

Enough about epistemology for now. I promised you I would talk about artificial intelligence, and now I will.

Underlying most work in neural networks and deep learning (the two terms are currently more or less synonymous) is the idea of stochastic gradient descent, in particular as implemented in the backpropagation algorithm. The basic idea is that you can learn to map inputs to outputs through feeding the inputs to the network, seeing what comes out at the other hand, and compare it with the correct answer. You then adjust all the connection weights in the neural network so as to bring the output closer to the correct output. This process, which has to be done over and over again, can be seen as descending the error gradient, thus the name gradient descent. You can also think of this as the reward signal pushing around the model, repelling it whenever it does something bad.

(How do you know the correct output? In supervised learning, you have a training set with lots of inputs (e.g. pictures of faces) and corresponding outputs (e.g. the names of the people in the pictures). In reinforcement learning it is more complex, as the input is what an agent sees of the world, and the "correct" output is typically some combination of the actual reward the agent gets and the model's own estimate of the reward.)

Another type of learning algorithm that can be used for both supervised learning and reinforcement learning (and many other things as well) is evolutionary algorithms. This is a family of algorithms based on mimicking Darwinian evolution by natural selection; algorithms in this family include evolution strategies and genetic algorithms. When using evolution to train a neural net, you keep a population of different neural nets and test them on whatever task they are supposed to perform, such as recognizing faces or playing a game. Every generation, you throw out the worst-performing nets, and replace them with "offspring" of the better-performing neural nets; essentially, you make copies and combinations of the better nets and apply small perturbations ("mutations") to them. Eventually, these networks learn to perform their tasks well.

So we have two types of algorithms that can both be used for performing both supervised learning and reinforcement learning (among other things). How do they measure up?

To begin with, some people wonder how evolutionary algorithms could work at all. It is perhaps important to point out here that evolutionary algorithms are not random search. While randomness is used to create new individuals (models) from old ones, fitness-based selection is necessary for these algorithms to work. Even a simple evolution strategy, which can be implemented in ten or so lines of code, can solve many problems well. Additionally, decades of development of the core idea of evolution as a learning and search strategy has resulted in many more sophisticated algorithms, including algorithms that base the generation of new models on adaptive models of the search space, algorithms that handle multiple objectives, and algorithms that find diverse sets of solutions.

Gradient descent is currently much more popular than evolution in the machine learning community. In fact, many machine learning researchers do not even take evolutionary algorithms seriously. The main reason for this is probably the widespread belief that evolutionary algorithms are very inefficient compared to gradient descent. This is because evolutionary algorithms seem to make use of less information than gradient descent does. Instead of incorporating feedback every time a reward is found in a reinforcement learning problem, in a typical evolutionary algorithm only the end result of an episode is taken into. For example, when learning to play Super Mario Bros, you could easily tell a gradient descent-based algorithm (such as Q-learning) to update every time Mario picks up a coin or gets hurt, whereas with an evolutionary algorithm you would usually just look at how far Mario got along the level and use that as feedback.

Another way in which evolution uses less information than gradient descent is that the changes to the network are not necessarily done so as to minimize the error, or in general to make the network as good as possible. Instead, the changes are generally completely random. This strikes many as terribly wasteful. If you have a gradient, why not use it?

(Additionally, some people seem to dislike evolutionary computation because it is too simple and mathematically uninteresting. It is true that you can't prove many useful theorems about evolutionary algorithms. But come on, that's not a serious argument against evolutionary algorithms, more like a prejudice.)

So is the idea that evolutionary algorithms learn less efficiently than gradient descent supported by empirical evidence? Yes and maybe. There is no question that the most impressive results coming out of deep learning research are all built on gradient descent. And for supervised learning, I have not seen any evidence that evolution achieves anything like the same sample-efficiency as gradient descent. In reinforcement learning, most of the high-profile results rely on gradient descent, but they also rely on enormous computational resources. For some reinforcement learning problems which can be solved with small networks, evolutionary algorithms perform much better than any gradient descent-based methods. They also perform surprisingly well on playing Atari games from high-dimensional visual input (which requires large, deep networks) and are the state of the art on the MuJoCo simulated robot control task.

Does evolutionary algorithms have any advantage over gradient descent? Yes. To begin with, you can use them even in cases where you cannot calculate a gradient, i.e. your error function is not differentiable. You cannot directly learn program code or graph structures with gradient descent (though there are indirect ways of doing it) but that's easy for evolutionary algorithms. However, that's not the angle I wanted to take here. Instead I wanted to reconnect to the discussion of epistemology this post started with.

Here's my claim: learning by gradient descent is an implementation of empiricist induction, whereas evolutionary computation is much closer to the hypothetico-deductive process of Popper's critical rationalism. Therefore, learning by gradient descent suffers from the same kind of limitations as the empiricist view of knowledge acquisition does, and there are things that evolutionary computation can learn but gradient descent probably cannot.

How are those philosophical concepts similar to these algorithms? In gradient descent, you are performing frequent updates in the direction that minimizes error. The error signal can be seen as causal: when there is an error, that error causes the model to change in a particular way. This is the same process as when a new observation causes a change in a person's belief ("writing our experience on the blank slate of the mind"), within the empiricist model of induction. These updates are frequent, making sure that every signal has a distinct impression on the model (batch learning is often used with gradient descent, but generally seen as a necessary evil). In contrast, in evolutionary computation, the change in the model is not directly caused by the error signal. The change is stochastic, not directly dependent on the error and not in general in the direction that minimizes the error, and in general much less common. Thus the model can be seen as a hypothesis, which is tested through applying the fitness function. Models are generated not from the data, but from previous hypotheses and random changes; they are evaluated by testing their consequences using the fitness function. If they are good, they stay in the population and more hypotheses are generated from them; if they are bad, they die.

How about explicitly trying to falsify the hypothesis? This is a key part of the Popperian mode of knowledge acquisition, but it does not seem to be part of evolutionary computation per se. However, it is part of competitive coevolution. In competitive coevolution, two or more populations are kept, and the fitness of the individuals in one population are dependent on how well they are perform against individuals in the other population. For example, one population could contain predators and the other prey, or one could contain image generators and the other image recognizers. As far as I know, the first successful example of competitive coevolution was demonstrated in 1990; the core idea was later re-invented (though with gradient descent instead of evolutionary search) in 2014 as generative adversarial networks.

If you accept the idea that learning by gradient descent is fundamentally a form of induction as described by empiricists, and that evolutionary computation is fundamentally more like the hypothetico-deductive process of Popperian critical rationalism, where does this take us? Does it say anything about what these types of algorithms can and cannot do?

I believe so. I think that certain things are extremely unlikely to ever be learned by gradient descent. To take an obvious example, I have a hard time seeing gradient descent ever learning F=ma or E=mc2. It's just not in the data - it has to be invented. And before you reply that you have a hard time seeing how evolution could learn such a complex law, note that using evolutionary computation to discover natural laws of a similar complexity has been demonstrated almost a decade ago. In this case, the representation (mathematical expressions represented as trees) is distinctly non-differentiable, so could not even in principle be learned through gradient descent. I also think that evolutionary algorithms, working by fewer and bolder strokes rather than a million tiny steps, is more likely to learn all kinds of abstract concepts. Perhaps the area where this is likely to be most important is reinforcement learning, where allowing the reward to push the model around seems to not be a very good idea in general and testing and discarding complete strategies may be much better.

So what should we do? Combine multiple types of learning of course! There are already hundreds (or perhaps thousands) of researchers working on evolutionary computation, but for historical reasons the evolutionary computation community is rather dissociated from the community of researchers working on machine learning by gradient descent. Crossover between evolutionary learning and gradient descent yielded important insights three decades ago, and I think there is so much more to learn. When you think about it, our own intelligence is a combination of evolutionary learning and lifetime learning, and it makes sense to build artificial intelligence on similar principles.

I am not saying gradient descent is a dead end nor that it will necessarily be superseded. Backpropagation is obviously a tremendously useful algorithm and gradient descent a very powerful idea. I am also not saying that evolutionary algorithms are the best solution for everything - they very clearly are not (though some have suggested that they are the second best solution for everything). But I am saying that backpropagation is by necessity only part of the solution to the problem of creating learning machines, as it is fundamentally limited to performing induction, which is not how real discoveries are made.

Some more reading: Kenneth Stanley has though a lot about the advantages of evolution in learning, and he and his team has written some very insightful things about this. Several large AI labs have teams working on evolutionary deep learning, including Uber AI, Sentient Technologies, DeepMind, and OpenAI. Gary Marcus has recently discussed the virtues of "innateness" (learning on evolutionary timescales) in machine learning. I have worked extensively with evolutionary computation in game contexts, such as for playing games and generating content for games. Nine years ago, me and a perhaps surprising set of authors set out to briefly characterize the differences between phylogenetic (evolutionary) and ontogenetic (gradient descent-based) reinforcement learning. I don't think we got to the core of the matter back then - this blog post summarizes a lot of what I was thinking but did not know how to express properly then. Thanks to several dead philosophers for helping me express my thoughts better. There's clearly more serious thinking to be done about this problem.

I'm thinking about turning this blog post into a proper paper at some point, so feedback of all kinds is welcome.

Saturday, October 28, 2017

IEEE Transactions on Games, your new favorite journal for games research

At the start of 2018, I will officially become the Editor-in-Chief of the IEEE Transactions on Games (ToG). What is this, a new journal? Not quite: it is the continuation of the IEEE Transactions on Computational Intelligence and AI in Games (TCIAIG, which has been around since 2009), but with a shorter name and much wider scope.

This means that I will have the honor of taking over from Simon Lucas, who created TCIAIG and served as its inaugural Editor-in-Chief, and Graham Kendall, who took over from Simon. Under their leadership, TCIAIG has become the most prestigious journal for publishing work on artificial intelligence and games.

However, there is plenty of interesting work on games, with games or using games, which is not in artificial intelligence. Wouldn't it be great if we had a top-quality journal, especially one with the prestige of an IEEE Transactions, where such research could be published? This is exactly the thought behind the transformed journal. The scope of the new Transactions on Games simply reads:

The IEEE Transactions On Games publishes original high-quality articles covering scientific, technical, and engineering aspects of games.

This means that research on artificial intelligence for games, and games for artificial intelligence, is very welcome, just as it is was in TCIAIG. But ToG will also be accepting papers on human-computer interaction, graphics, educational and serious games, software engineering in games, virtual and augmented reality, and other topics.The scope specifically indicates "scientific, technical engineering aspects of games", and I expect that the vast majority of what is published will be empirial and/or quantitative in nature. In other words, game studies work belonging primarily in the humanities will be outside the scope of the new journal. The same goes for work that has nothing to do with games, for example, game theory applied to non-game domains. (While there is some excellent work on game theory applied to games, much game theory research has nothing to do with games that anyone would play.) Of course, acceptance/rejection decisions will be taken based on the recommendations of Associate Editors, who act on the recommendations of reviewers, leaving some room for interpretation of the exact boundaries of what type of research the journal will publish.

Already before I take over as Editor-in-Chief, I am working together with Graham to refresh the editorial board of the journal. I expect to keep many of the existing TCIAIG associate editors, but will need to replace some, and in particular add more associate editors with knowledge of the new topics where the journal will publish papers, and visibility in those research communities. I will also be working on reaching out to these research communities in various ways, to encourage researchers there to submit their best work the IEEE Transactions on Games.

Given that I will still be teaching, researching and leading a research group at NYU, I will need to cut down on some other obligations to free up time and energy for the journal. As a result, I will be very restrictive when it comes to accepting reviewing tasks and conference committee memberships in the near- to mid-term future. So if I turn down your review request, don't take it personally.

Needless to say, I am very excited about taking on this responsibility and work on making ToG the journal of choice for anyone doing technical, engineering or scientific research related to games.

Sunday, July 23, 2017

Some advice for journalists writing about artificial intelligence

Dear Journalists,

I'd like to offer some advice on how to write better and more truthfully when you write articles about artificial intelligence. The reason I'm writing this is that there are a whole lot of very bad articles on AI (news articles and public interest articles) being published in newspapers and magazines. Some of them are utter nonsense, bordering on misinformation, some of them capture the gist of what goes on but are riddled with misunderstandings. No, I will not provide examples, but anyone working in AI and following the news can provide plenty. There are of course also many good articles about AI, but the good/bad ratio could certainly be improved.

First off, I understand. You're writing about an extremely fast-moving field full of jargon and enthusiastic people with grand visions. Given all this excitement, there must be plenty to write about, but you don't know much (or even anything) about the field. You probably know as little about AI as I know about, say, tannery. But where tannery evolves only very slowly and involves very concrete materials and mechanics, AI moves at breakneck speed and few of those words that get thrown around seem to refer to anything you can touch or see. There's a feeling that you need to write about the latest developments NOW before they are superseded, but it's hard to see where to even begin to decipher the strange things those AI researchers say. And of course you want to write something readable, and clickable, and you don't have much time. It can't be easy.

So here's a few things to keep in mind, and some concrete recommendations, for more critical and higher-quality reporting on AI. Some of this is based on my experience with being interviewed by journalists of varying technical proficiency, and with varying inclination to buy the story I was trying to sell them. Yes, we're all trying to sell something, even we curmudgeons in the ivory tower are trying to sell you something. More about this below.

Keep in mind: AI is a big field, and very diverse in terms of topics and methods used. (True, it's not as diverse as it should be in some other senses.) The main AI conferences (such as IJCAI, AAAI, ICML and NIPS) have thousands of attendees, and most of them only understand a small part of what goes on in the conference. When I go to one of these conferences, I can perhaps follow maybe 20% of the talks and get something out of them. While I might be a bit dim myself, it's rare to find anyone who can keep up to date with sub-fields as diverse as constraint propagation, deep learning and stochastic search.

Recommendation: Do not assume that researchers you talk to knows "what's going on right now in AI". Even more importantly, if someone says they know what's going on right now in AI, assume that they only know a small part of the big picture. Double-check with someone working in another field of AI.

Keep in mind: There is no such thing as "an artificial intelligence". AI is a collection of methods and ideas for building software that can do some of the things that humans can do with their brains. Researchers and developers develop new AI methods (and use existing AI methods) to build software (and sometimes also hardware) that can do something impressive, such as playing a game or drawing pictures of cats. However, you can safely assume that the same system cannot both play games and draw pictures of cats. In fact, no AI-based system that I've ever heard of can do more than a few different tasks. Even when the same researchers develop systems for different tasks based on the same idea, they will build different software systems. When journalists write that "Company X's AI could already drive a car, but it can now also write a poem", they obscure the fact that these are different systems and make it seem like there are machines with general intelligence out there. There are not.

Recommendation: Don't use the term "an AI" or "an artificial intelligence". Always ask what the limitations of a system is. Ask if it really is the same neural network that can play both Space Invaders and Montezuma's Revenge (hint: it isn't).

Keep in mind: AI is an old field, and few ideas are truly new. The current, awesome but a tad over-hyped, advances in deep learning have their roots in neural network research from the 1980s and 1990s, and that research in turn was based on ideas and experiments from all the way back in the 1940s. In many cases, cutting edge research consists of minor variations and improvements on methods that were devised before the researchers doing these advances were born. Backpropagation, the algorithm powering most of today's deep learning, is several decades old and was invented independently by multiple individuals. When IBM's Deep Blue computer won over Garry Kasparov and showed that computers could play Chess better than humans, the very core of the software was the Minimax algorithm, first implemented by Alan Turing in the 1940s. Turing, one of the fathers of both artificial intelligence and the wider field of computer science, also wrote the paper "On Computing Machinery and Intelligence" which was published in 1950. While that paper is most famous for introducing what is now called the Turing Test, it also contains the seeds of many of the key ideas in artificial intelligence.

Recommendations: Read Turing's 1950 paper. It's surprisingly easy and enjoyable to read, free from mathematical notation, and any technical terms can easily be glossed over. Marvel at how many of the key ideas of artificial intelligence were already in place, if only in embryonic form. When writing stories about exciting new developments, also consult an AI researcher that is old, or at least middle aged. Someone who was doing AI research before it was cool, or perhaps even before it was uncool, and so has seen a full cycle of AI hype. Chances are that person can tell you about which old idea this new advance is a (slight?) improvement on.

Keep in mind: Researchers always have something to sell. Obviously, those working in some kind of startup are looking to increase the valuation of their company and their chances of investment or acquisition. Those working in academia are looking for talk invitations, citations, promotions and so on. Those working in a large company will want to create interest in some product which might or might not be related to their actual results.

Recommendations: Don't believe the hype. Approach another researcher, who the people you're writing about did not forward you to, and ask if that person believes their claims.

Keep in mind: Much of "artificial intelligence" is actually human ingenuity. There's a reason why researchers and developers specialize in applications of AI to specific domains, such as robotics, games or translation: when building a system to solve a problem, lots of knowledge about the actual problem ("domain knowledge") is included in the system. This might take the role of providing special inputs to the system, using specially prepared training data, hand-coding parts of the system or even reformulating the problem so as to make it easier.

Recommendation: A good way of understanding which part of an "AI solution" are automatic and which are due to niftily encoded human domain knowledge is to ask how this system would work on a slightly different problem.

I'd better stop writing here, as this text probably already sounds far too grumpy. Look, I'm not grumpy, I'm barely even old. And I don't want to give the impression that there isn't a lot of exciting progress in AI these days. In fact, there are enough genuine advances to report on that we don't need to pad out the reporting with derivate research that's being sold as new. Let's all try to be honest, critical and accurate, shall we?

Friday, November 04, 2016

How Darwin plays StarCraft

StarCraft is perhaps the single hardest game for computers to play well. At least if you only count games that people care about; you could of course construct games that were harder, but there's no guarantee anyone would play those games. When doing AI research, working on games that people care about means you are working on relevant problems. This is because games are designed to challenge the human brain and successful games are typically good at this. StarCraft (and its successor StarCraft 2) are played and loved by millions of people all over the world, with a very active competition scene where pro players are well-paid stars.

And there's no question that the game is hard; there is a series of AI StarCraft competitions that has been running since 2010, but the best AI players are still at the level of human novices. In other words, roughly where the best AI Go players were 15 years ago, or the best AI Chess players were 50 years ago. As computers are now able to play Chess and Go better than the best humans, the question is when we can surpass human ability for StarCraft as well.

It's not just me thinking this. Google DeepMind recently announced that StarCraft 2 will be one of their major new testbeds, after their success at training deep networks to play Atari games in the ALE framework. Facebook AI Research recently published their first paper on using machine learning to learn to play StarCraft and just today submitted another, showing that they take this challenge seriously. In academia, there is already a rich body of work on algorithms for playing (parts of) StarCraft, or generating maps for it. Given the game's complexity, it is unlikely we will conquer all of it soon; we have our work cut out for us.

A screenshot from the original StarCraft game

One of the reasons the game is so hard is that playing it well requires thinking and acting on different levels of abstraction. The game requires resource collection management, build order scheduling, prioritizing technology development, exploration, micro-management of troops as well as overall strategy and ways of deducing and countering the adversary's strategy. Trying to build an AI that can do all this well is very very hard. It is therefore prudent to approach the various parts of the problem separately.

In a new paper, we propose a new algorithm for playing StarCraft micro, given a forward model. "Micro" is the second-to-second, sometimes frame-to-frame, business of managing armies of StarCraft units in combat. The difficulty of playing micro is the reason professional (human) StarCraft players often average several hundred mouse-clicks per minute. To an unprepared onlooker good micro play tends to look chaotic, while it is in reality a highly complex affair with certain maneuvers requiring extreme skill.

A StarCraft battle with no micro tactics.
The green troops to the left don't move at all, and lose the battle.

The same battle with active micro tactics.
By moving around units depending on their health and cooldown level, a much better results is achieved.

So, what AI methods can we use to play StarCraft micro? There have been a number of attempts to use various forms of tree search including Monte Carlo Tree Search (MCTS), a core component in AlphaGo, the software that recently beat Lee Sedol to become world champion at the game of Go. The problem with using tree search to play StarCraft is the extremely high branching factor, meaning the extremely high number of possible actions that could be taken at any time. Where Chess has an average branching factor of around 35, and Go has an average branching factor of about 300, StarCraft micro often reaches branching factors of millions. This is because you don't just move one piece, you often have 10 to 50 different units to control at the same time. And the number of possible actions increases exponentially with the number of units that can act at the same time. Complex indeed.

Standard tree search algorithms, including MCTS, perform very badly when faced with such enormous numbers of actions to choose from. Basically, there are so many actions to consider that they run out of time before even considering a single step forward. So we need to approach this problem some other way. In some work presented earlier this year, and which concerned another strategy game, we attempted to use evolutionary computation instead of tree search to play the game. This worked very well - I wrote a separate blog post about that work.

Portfolio Online Evolution (2 scripts) in the JarCraft Simulator versus script-based UCT

The basic idea is to run an evolutionary algorithm every time step to select what to do next. Each "chromosome" (or "solution" or "individual") is a set of actions - one or more actions for each unit in the game. All the chromosomes are then scored based on how good results they achieve in simulation, and then the good chrosomes are kept, the less good ones thrown away and replaced with mutated copies of the good ones, again and again. Essentially Darwinian evolution in a computer program. Well, actually it's a bit more complicated, but that's the gist of it. We call this method Online Evolution, because it uses evolution, but not to tune a controller ("offline") as is often done, instead evolution is used as an action selection mechanism ("online").

For StarCraft we wanted to combine this very effective method with a way of incorporating domain knowledge about StarCraft playing. Fortunately, Dave Churchill at Memorial University of Newfoundland had already come up with a clever idea here, in the form of Portfolio Greedy Search. The core idea here is to not select directly among the different actions (for example move to a particular place, or attack a particular enemy). Instead, his algorithm uses a number of existing "scripts", which are simple rules for what units should do in different situations. Churchill's method uses a simple greedy search algorithm to search for what script to use to control each unit each time step.

Which finally brings us to the new algorithm we introduce in our paper: Portfolio Online Evolution. As the name suggests, this is a combination of the ideas of Online Evolution and Portfolio Greedy Search. You might already have figured this out by now, but what it does is to evolve plans for what script each unit should use each time step. Each chromosome contains a sequence of scripts for each unit, and they are evaluated by simulating a number of steps forward in simulation and seeing the results of using this sequence of scripts. (Quite simply, we use the difference in total hitpoints between our team and the adversary as the score.)

Portfolio Online Evolution (6 scripts) in the JarCraft Simulator versus script-based UCT

So does Portfolio Online Evolution work in StarCraft? Hell yes! It kicks ass. Protoss ass. We tested the algorithm using the JarCraft simulator, which is very convenient to work in as the real StarCraft game itself lacks a forward model. JarCraft comes with a several tree search methods implemented. It turns out Portfolio Online Evolution beats all of them soundly. What's more, its margin of victory gets bigger the larger the battle size (number of units on each side), and the number of scripts supplied to the algorithm. We were of course very happy with this result.

So where did this leave us? We can't play a full StarCraft yet, can we? No, because the full StarCraft game does not have a forward model, meaning that it cannot be simulated much faster than real-time. Portfolio Online Evolution, like other methods that search the space of future game states, require a fast forward model. It seems that we will have to concentrate on creating methods for learning such forward models in games such as StarCraft, to allow effective AI methods to be used.

In the nearer term, one of our ongoing projects is to learn the scripts themselves, to expand the repertoire of scripts available for the evolutionary script selection mechanism to choose from.

Finally, a note on who did what: When I say "we", I mean mostly a team of students led by Che "Watcher" Wang from NYU Shanghai. The other participants were Pan Chen and Yuanda Li, and the work was supervised by Christoffer Holmgård and myself. The project started as a course project in my course on AI for Games, and Watcher then wrote most of the paper. The paper was presented at the AIIDE conference a few weeks ago.

Tuesday, November 01, 2016

Overcoming the limits of pre-AI designs

In a VentureBeat article I argue that most fundamental game designs were made at a point in time were many AI algorithms were not invented, and the computers of the time were too weak to run those algorithms that existed. Therefore games were designed not to need AI.

Nowadays, we have much more sophisticated AI methods and better computers to run them. However, game design has not kept up. Our games are still designed not to need AI, probably because our game designs are evolutions of the same old designs. This is why many game developers argue that their games don't need AI.

We need to go beyond this, and redesign games with AI capabilities in mind. Perhaps design the games around the AI. There are some more examples from academia here.

Of course, this argument applies to many other things than games. Perhaps most things we do.

Thursday, September 29, 2016

How to organize CIG

When you run an annual conference series, it is important to maintain continuity and make sure that best practices live on from year to year. For many conferences, this seems to happen organically and/or imperfectly. For the IEEE Conference on Computational Intelligence and Games, we have a Games Technical Committee to oversee the conference, and a habit of always keeping a few people from previous years' conference organizing committee on the next year's committee. Now, we also have a set of of organizers' guidelines.

I took the initiative to formalize the rules we have informally and gradually agreed on for the conference back in 2014. I wrote a first draft, and then circulated it to all previous chairs of the conference, receiving much useful feedback. A number of people provided useful feedback, additions and/or edits of the text; among those who contributed substantially are Phil Hingston, Simon Lucas and Antonio Fernández Leiva (there are probably more, but I can't find them in the mail chain).

The complete guidelines can be found here, and also here. Please note that this is nothing like a final document with rules set in stone (and who would be to have the authority to do that anyway?). Rather, it's a starting point for future discussions about rules and practices. Our idea is that it can be very useful for future CIG organizers to have the way the conference has been organized written down in a single place. It could also be useful for people organizing other conferences, inside and outside AI and games.

While we're at it, I'd like to point out that I've also written about how to organize game-based AI competitions. This could be a useful resource for anyone who's into organizing competitions.

Tuesday, August 09, 2016

Algorithms that select which algorithm should play which game

This was meant to be a short post announcing some new papers of ours, but it grew. So what you get is a post announcing some new papers, with plenty of context.

Video games are possibly the best way of testing artificial intelligence algorithms. At least I've argued this (at some length) before. One of the things that video games offer that for example robot problems don't offer, is to easily and quickly test the same algorithm on a large number of games. That is one of the guiding principles of General Video Game AI Competition (GVGAI), which has certain advantages over other game-based AI testbeds.

Some games implemented in General Video Game AI Framework.

Now in practice we don't yet have any algorithm that can play all different video games well. (If we had, we would pretty much have solved AI. Really. Remember that "video games" is a very broad category, including Sokoban and Zork as well as Witcher 3 and Ikaruga. The GVGAI competition is more limited, but still covers a broad range of games.) For some games, we have no algorithms that can play the game well at all. For other games, we have algorithms that can play those games at human level, or sometimes even better. But an algorithm which can play Space Invaders well might not be very good at Boulder Dash, and suck completely at all a puzzle game such as A Good Snowman is Hard to Build. Conversely, a good snowman-building AI might not be very adept at fending off invading space aliens. We like to say that game-playing performance is intransitive.

But what we want is a single algorithm that can play any game well. Because intelligence is not about being able to perform a single task, but to perform all (or most) tasks (from within some reasonably general domain) well. Given that what we have are different algorithms that can each only perform some minority of tasks well, how do we get there?

One answer is to build an algorithm that includes a number of different game-playing algorithms, the "champions" of each type of game. When faced with new game, it would choose one of its constituent algorithms to play the game. If for each game the algorithm selects the best sub-algorithm to play that game, it would perform much better than any one algorithm on its own. Ex pluribus unum, stronger together, the whole is greater than the sum of its parts and all that. The question then is of course how to select the right algorithm for each game.

How well a number of different algorithms play different games in the GVGAI framework. Lighter=better. Several different categories or clusters of games appear based on the performance of the algorithms.

Luckily, people have studied the problem of how to automatically select algorithms to solve problems for some time under the names algorithm selection (obviously) and hyper-heuristics (less obviously). Various methods for selecting which algorithms (or heuristics, which means the same in this context) have been explored. But not for video games. Only for more abstract and... I can't say boring, can I? Whatever. Hyper-heuristics and algorithm selection techniques have mostly been applied to more abstract (and perhaps boring) problems such as satisfiability and combinatorial optimization. These are important problems for sure, but they are not all there is. In terms of problems that map to relevant challenges for human intelligence, general video game playing is arguably more important.

My coworkers and I therefore set out investigate how hyper-heuristics can be applied to video games, more specifically video games in the GVGAI framework. Our first result is a paper published in the IEEE Conference on Computational Intelligence and Games this September - read it here. We show that using simple features easily available to the agent, we can predict which of several algorithms will play an unseen game best with relatively high accuracy. This allows us to build an agent based on a handful of those agents that did best in the 2014 and 2015 GVGAI competitions, and verify that by cleverly choosing which one of these to use for each new game it can outperform all of them on games that it has not been trained on. The classifier at the heart of this agent is a simple decision tree, which also allows us to inspect on what basis it selects certain agents over others.

A decision tree used in one of our hyper-heuristic agents. Based on features of the game, it decides which algorithm to use to play it.

However, we felt that these results were not enough - ideally, the agent should select the best-performing algorithm for a new game every time, not just most of the time. Maybe we could find better features to feed our algorithm selection algorithm, or maybe train it differently?

In our next paper, which will be presented at the AAAI Conference on Artificial Intelligence and Interactive Digital Entertainment, we set out to investigate this question in more depth. Read the paper here. To begin with, we tested the performance of a very broad range of algorithms on all known GVGAI games. We identified a number of groups of games where particular algorithms do well. We also investigated a broader set of features for selecting algorithms, hoping to find better ways of identifying classes of games that would be susceptible to specific types of game-playing algorithms. One idea we had was that those classes would somehow correspond to game genres as identified by humans (e.g. some algorithms would be better at shooters, others at puzzle games etc). While we still don't have enough features to reliably classify games into genres as seen by humans, some results support this notion. In particular we found that we could divide games into four different classes, where different algorithms play well, by looking only at two features: whether the avatar in the game can shoot, and whether it can die.

This decision tree looks only at whether the avatar can shoot and whether it can die.

Of course, there's a lot of research left to do here. We still need better algorithm selection methods, and trying to (automatically?) construct features is a promising direction here. We could also probably do better if we learn to switch algorithms during gameplay - maybe some algorithm is more useful in the initial phases of a game, and another algorithm is more useful for the endgame? And let's not forget that we have many games which no existing algorithm plays well.

Sunday, July 17, 2016

Which games are useful for testing artificial general intelligence?

It is very hard to make progress on artificial intelligence without having a good AI problem to work on. And it is impossible to verify that your software is intelligent without testing it on a relevant problem. For those who work on artificial general intelligence, the attempt to make AI that is generally intelligent as opposed to "narrow AI" for specific tasks, it is crucial to have reliable and accurate benchmarks of general intelligence.

I have previously written about why games are ideal as intelligence tests for AI. Here'd I like to go into more depth about what sort of games we would like to use to test AI, specifically AGI (artificial general intelligence). These are the properties I think games used to test AGI should have:

  • They should be good games. Well-designed games are more entertaining and/or immersive because they challenge our brains better; according to converging theories from game design, developmental psychology and machine learning the fun in playing largely comes from learning the game while playing. A game that is well-designed for humans is therefore probably a better AI benchmark.
  • They should challenge a broad range of cognitive skills. Classical board games largely focus on a rather narrow set of reasoning and planning skills. Video games can challenge a broader set of cognitive skills, including not only reasoning and planning but also e.g. perception, timing, coordination, attention, and even language and social skills.
  • * Most importantly, they should not be one game. Developing AI for a single game has limited value for general AI, as it is very easy to "overfit" your solution to particular game by implementing various domain-specific solutions (or, as they're usually called, hacks). In the past, we've seen this development over and over with AI developed for particular games (though occasionally something of great general value appears out of research on a particular game, such as the Monte Carlo Tree Search (MCTS) algorithm being invented to play Go). Therefore it is important that AI agents are tested on many different games as part of the same benchmark. Preferably these would games that the AI developer does not even know about when developing the AI.

So let's look at how the main game-based AI benchmarks stack up against these criteria.

To begin with, there are a number of game-based AI benchmarks based on individual games. A pretty big number, in fact. The annual IEEE Conference on Computational Intelligence and Games hosts a number of game-based AI competitions, where the software can also be used offline as a benchmark. And of course, classic board games such as Chess, Checkers and Go have long been used as AI benchmarks. An interesting recent addition is Microsoft's Project Malmo, which uses Minecraft as the base for an AI sandbox/benchmark.

But these are all focused on individual games, and therefore not well suited to benchmark general intelligence. Let's talk about general game playing frameworks.

General Game Playing Competition

First we have the General Game Playing Competition and its associated software. This competition has been going since 2005, initiated by Michael Genesereth. For the competition, a game description language was developed for encoding the games; this language is a logic programming language similar to Prolog, and allows the definition of in theory any turn-based game with discrete world state. (Initially these games could not have any hidden information, but that restriction has since been overcome with new versions of the language.) In practice, almost all games defined in this language are fairly simple in scope, and could very broadly be described as board games.

To compete in the General Game Playing competition, you submit an agent that can play any game defined in this language. The agents have access to the full game description, and typically a large part of the agent development goes into analyzing the game to find useful ways of playing it. The actual game-playing typically uses MCTS or some closely related algorithm. New games (or variations of old games) are used for each competition, so that competitors cannot tune their AIs to specific games. However, the complexity of developing games in the very verbose game description language limits the number and novelty of these games.

Arcade Learning Environment

The second entry in our list is the Arcade Learning Environment (ALE). This is a framework built on an emulator of the classic Atari 2600 game console from 1977 (though there are plans to include emulation of other platforms in the future). Marc Bellemare and Michael Bowling developed the first version of this framework in 2012, but opted to not organize a competition based on it. Agents can interface to the ALE framework directly and play any of several dozen games; in principle, any of the several hundred released Atari 2600 games can be adapted to work with the framework. Agents are only given a raw image feed for input, plus the score of the game. To play a game in the ALE framework, your agent therefore has to decipher the screen in some way to find out what all the colorful pixels mean.

Three different games in the ALE framework.

Most famously, the ALE framework was used in Google DeepMind's Nature paper from last year where they showed that they could train convolutional deep networks to play many of the classic Atari games. Based only on rewards (score and winning/losing) these neural networks taught themselves to play games as complex as Breakout and Space Invaders. This was undeniably an impressive feat. Figuring out what action to take based only on the screen input is far from a trivial transform, and the analogue to the human visuomotor loop suggests itself. However, each neural network was trained using more than a month of game time, which is clearly more than would be expected of e.g. a human learner to learn to play a single game. It should also be pointed out that the Atari 2600 is a simple machine with only 128 bytes of RAM, typically 2 kilobytes of ROM per game and no random number generator (because it has no system clock). Why does it take so long time to learn to play such simple games?

Also note that the networks trained for each of these games was only capable of playing the specific game it was trained on. To play another game, a new network needs to be trained. In other words, we are not talking about general intelligence here, more like a way of easily creating task-specific narrow AI. Unfortunately the ALE benchmark is mostly used in this way; researchers train on a specific game and test their trained AI's performance on the same game, instead of on some other game. Overfitting, in machine learning terms. As only a fixed number of games are available (and developing new games for the Atari 2600 is anything but a walk in the park) it is very hard to counter this by enforcing that researchers test their agents on new games.

General Video Game AI Competition

Which brings us to the third and final entry on my list, the General Video Game AI Competition (GVGAI) and its associated software. Let me start by admitting that I am biased when discussing GVGAI. I was part of the group of researchers that defined the structure of the Video Game Description Language (VGDL) that is used in the competition, and I'm also part of the steering committee for the competition. After the original concepts were defined at a Dagstuhl meeting in 2012, the actual implementation of the language and software was done first by Tom Schaul and then mostly by Diego Perez. The actual competition ran for the first year in 2014. A team centered at the University of Essex (but also including members of my group at NYU) now contributes to the software, game library and competition organization.

The "Freeway" game in the GVGAI framework.

The basic idea of GVGAI is that agents should always be tested on games they were not developed for. Therefore we develop ten new games each time the competition is run; we currently have a set of 60 public games, and after every competition we release ten new games into the public set. Most of these games are similar to (or directly based on) early eighties-style arcade games, though some are puzzle games and some have more similarities to modern 2D indie games.

In contrast to ALE, an agent developed for the GVGAI framework gets access to the game state in a nicely parsed format, so that it does not need to spend resources understanding the screen capture. It also gets access to a simulator, so it can explore the future consequences of each move. However, in contrast to both ALE and GGP, agents do not currently get any preparation time, but needs to start playing new games immediately. In contrast to GGP, GVGAI bots do also not currently get access to the actual game description - they must explore the dynamics of the game by attempting to play it. This setup advantages different agents than the ALE framework. While the best ALE-playing agents are based on neural networks, the best GVGAI agents tend to be based on MCTS and similar statistical tree search approaches.

The game "Run" in GVGAI.

The GVGAI competition and framework is very much under active development, and in addition to the planning track of the competition (with the rules described above), there is now a two-player track and a learning track is in the works, where agents get time to adapt to a particular game. We also just ran the level generation track for the first time, where competitors submit level generators rather than game-playing agents, and more tracks are being discussed. Eventually, we want to be able to automatically generate new games for the framework, but this research yet has some way to go.

To sum up, is any of these three frameworks a useful benchmark for artificial general intelligence? Well, let us acknowledge the limitations first. None of them test things skills such as natural language understanding, story comprehension, emotional reasoning etc. However, for the skills they test, I think they each offer something unique. GGP is squarely focused on logical reasoning and planning in a somewhat limited game domain. ALE focuses on perception and to some extent planning in a very different game domain, and benefits from using the original video games developed for human players. I would like to argue that GVGAI tests the broadest range of cognitive skills through having the broadest range of different games, and also the best way of preventing overfitting through the simplicity of creating new games for the framework. But you should maybe take this statement with a pinch of salt as I am clearly biased, being heavily involved in the GVGAI competition. In any case, I think it is fair to say that using any of these frameworks clearly beats working on a single game if you are interested in making progress on AI in general, as opposed to a solution for a particular problem. (But by all means, go on working on the individual games as well - it's a lot of fun.)