Sunday, May 27, 2018

Empiricism and the limits of gradient descent

This post is actually about artificial intelligence, and argues a position that many AI researchers will disagree with. Specifically, it argues that the method underlying most of deep learning has severe limitations which another, much less popular method can overcome. But let's start with talking about epistemology, the branch of philosophy which is concerned with how we know things. Then we'll get back to AI.

Be warned: this post contains serious simplifications of complex philosophical concepts and arguments. If you are a philosopher, please do not kill me for this. Even if you are not a philosopher, just hear me out, OK?

In the empiricist tradition in epistemology, we get knowledge from the senses. In the 17th century, John Locke postulated that the mind is like a blank slate, and the only way which we can get knowledge is through sense impressions: these impressions figuratively write our experience onto this blank slate. In other words, what we perceive through our eyes, ears and other sense organs causes knowledge to be formed and accumulated within us.

The empiricist tradition of thought has been very influential for the last few centuries, and philosophers such as Hume, Mill and Berkeley contributed to the development of empiricist epistemology. These thinkers shared the conviction that knowledge comes to us through experiencing the world outside of us through our sense. They differed in what they thought we can directly experience - for example, Hume though we can not experience causality directly, only sequences of world-states - and exactly how the sense impressions create knowledge, but they agree that the sense impressions are what creates knowledge.

In the 20th century, many philosophers wanted to explain how the (natural) sciences could be so successful, and what set the scientific mode of acquiring knowledge apart from superstition. Many of them were empiricists. In particular, the Vienna Circle, a group of philosophers, mathematicians, and physicists inspired by the early work of Wittgenstein, articulated a philosophy that came to be known as Logical Empiricism. The basic idea is that sense impressions is all there is, and that all meaningful statements are complex expressions that can be analyzed down to their constituent statements about sense impressions. We gain knowledge through a process known as induction, where we generalize from our sense impressions. For example, after seeing a number of swans that are white you can induce that swans are white.

A philosopher that was peripheral to the Vienna Circle but later became a major figure in epistemology in his own right was Karl Popper. Popper shared the logical empiricists' zeal for explaining how scientific knowledge was produced, but differed radically in where he thought knowledge came from. According to Popper, facts do not come from sense impressions. Instead, they come "from within": we formulate hypotheses, meaning educated guesses, about the world. These hypotheses are then tested against our sense impressions. So, if we hypothesize that swans are white, we can then check this with what our eyes tell us. Importantly, we should try to falsify our hypotheses, not to verify them. If the hypothesis is that swans are white, we should go looking for black swans, because finding one would falsify our hypothesis. This can be easily motivated with that if we already think swans are white, we're not getting much new information by seeing lots of white swans, but seeing a black swan (or trying hard but failing to find a black swan) would give us more new information.

Popper called his school of thought "critical rationalism". This connects to the long tradition of rationalist epistemology, which just like empiricist epistemology has been around for most of the history of philosophy.  For example, Descartes' "I think, therefore I am" is a prime example of knowledge which does not originate in the senses.

Among (natural) scientists with a philosophical bent, Popper is extremely popular. Few modern scientists would describe themselves as logical empiricists, but many would describe themselves as critical rationalists. The main reason for this is that Popper describes ways of successfully creating scientific knowledge, and the logical empiricists do not. To start with the simple case, if you want to arrive at the truth about the color of swans, induction is never going to get you there. You can look at 999999 white swans and conclude that they are all white, but the millionth may be black. So there can be no certainty. With Popper's hypothetico-deductive method you'd make a hypothesis about the whiteness of swans, and then go out and actively try to find non-white swans. There's never any claim of certainty, just of an hypothesis having survived many tests.

More importantly, though, the logical empiricist story suffers from the problem that more complex facts are simply not in the data. F=ma and E=mc2 are not in the data. However many times you measure forces, masses and accelerations of things, the idea that the force equals mass times acceleration is not going to simply present itself. The theories that are at the core of our knowledge cannot be discovered in the data. They have to be invented, and then tested against the data. And this is not confined to large, world-changing theories.

If I already have the concepts of swan, white and black at the ready, I can use induction to arrive at the idea that all swans are white. But first I need to invent these concepts. I need to decide that there is such a thing as a swan. Inductivists such as Hume would argue that this could happen through observing that "a bundle of sense impressions" tend to co-occur whenever we see a swan. But a concept such a swan is actually a theory: that the animal is the same whether it's walking of flying, that it doesn't radically change its shape or color, and so on. This theory needs to somehow be invented, and then tested against observation.

In other words, empiricism is at best a very partial account of how we get knowledge. On its own, it can't explain how we arrive at complex concepts or theories, and it does not deliver certainty. Perhaps most importantly, the way we humans actually do science (and other kinds of advanced knowledge production) is much more like critical rationalism than like empiricism. We come up with theories, and we work to confirm of falsify them. Few scientists just sit around and observe all day.

Enough about epistemology for now. I promised you I would talk about artificial intelligence, and now I will.

Underlying most work in neural networks and deep learning (the two terms are currently more or less synonymous) is the idea of stochastic gradient descent, in particular as implemented in the backpropagation algorithm. The basic idea is that you can learn to map inputs to outputs through feeding the inputs to the network, seeing what comes out at the other hand, and compare it with the correct answer. You then adjust all the connection weights in the neural network so as to bring the output closer to the correct output. This process, which has to be done over and over again, can be seen as descending the error gradient, thus the name gradient descent. You can also think of this as the reward signal pushing around the model, repelling it whenever it does something bad.

(How do you know the correct output? In supervised learning, you have a training set with lots of inputs (e.g. pictures of faces) and corresponding outputs (e.g. the names of the people in the pictures). In reinforcement learning it is more complex, as the input is what an agent sees of the world, and the "correct" output is typically some combination of the actual reward the agent gets and the model's own estimate of the reward.)

Another type of learning algorithm that can be used for both supervised learning and reinforcement learning (and many other things as well) is evolutionary algorithms. This is a family of algorithms based on mimicking Darwinian evolution by natural selection; algorithms in this family include evolution strategies and genetic algorithms. When using evolution to train a neural net, you keep a population of different neural nets and test them on whatever task they are supposed to perform, such as recognizing faces or playing a game. Every generation, you throw out the worst-performing nets, and replace them with "offspring" of the better-performing neural nets; essentially, you make copies and combinations of the better nets and apply small perturbations ("mutations") to them. Eventually, these networks learn to perform their tasks well.

So we have two types of algorithms that can both be used for performing both supervised learning and reinforcement learning (among other things). How do they measure up?

To begin with, some people wonder how evolutionary algorithms could work at all. It is perhaps important to point out here that evolutionary algorithms are not random search. While randomness is used to create new individuals (models) from old ones, fitness-based selection is necessary for these algorithms to work. Even a simple evolution strategy, which can be implemented in ten or so lines of code, can solve many problems well. Additionally, decades of development of the core idea of evolution as a learning and search strategy has resulted in many more sophisticated algorithms, including algorithms that base the generation of new models on adaptive models of the search space, algorithms that handle multiple objectives, and algorithms that find diverse sets of solutions.

Gradient descent is currently much more popular than evolution in the machine learning community. In fact, many machine learning researchers do not even take evolutionary algorithms seriously. The main reason for this is probably the widespread belief that evolutionary algorithms are very inefficient compared to gradient descent. This is because evolutionary algorithms seem to make use of less information than gradient descent does. Instead of incorporating feedback every time a reward is found in a reinforcement learning problem, in a typical evolutionary algorithm only the end result of an episode is taken into. For example, when learning to play Super Mario Bros, you could easily tell a gradient descent-based algorithm (such as Q-learning) to update every time Mario picks up a coin or gets hurt, whereas with an evolutionary algorithm you would usually just look at how far Mario got along the level and use that as feedback.

Another way in which evolution uses less information than gradient descent is that the changes to the network are not necessarily done so as to minimize the error, or in general to make the network as good as possible. Instead, the changes are generally completely random. This strikes many as terribly wasteful. If you have a gradient, why not use it?

(Additionally, some people seem to dislike evolutionary computation because it is too simple and mathematically uninteresting. It is true that you can't prove many useful theorems about evolutionary algorithms. But come on, that's not a serious argument against evolutionary algorithms, more like a prejudice.)

So is the idea that evolutionary algorithms learn less efficiently than gradient descent supported by empirical evidence? Yes and maybe. There is no question that the most impressive results coming out of deep learning research are all built on gradient descent. And for supervised learning, I have not seen any evidence that evolution achieves anything like the same sample-efficiency as gradient descent. In reinforcement learning, most of the high-profile results rely on gradient descent, but they also rely on enormous computational resources. For some reinforcement learning problems which can be solved with small networks, evolutionary algorithms perform much better than any gradient descent-based methods. They also perform surprisingly well on playing Atari games from high-dimensional visual input (which requires large, deep networks) and are the state of the art on the MuJoCo simulated robot control task.

Does evolutionary algorithms have any advantage over gradient descent? Yes. To begin with, you can use them even in cases where you cannot calculate a gradient, i.e. your error function is not differentiable. You cannot directly learn program code or graph structures with gradient descent (though there are indirect ways of doing it) but that's easy for evolutionary algorithms. However, that's not the angle I wanted to take here. Instead I wanted to reconnect to the discussion of epistemology this post started with.

Here's my claim: learning by gradient descent is an implementation of empiricist induction, whereas evolutionary computation is much closer to the hypothetico-deductive process of Popper's critical rationalism. Therefore, learning by gradient descent suffers from the same kind of limitations as the empiricist view of knowledge acquisition does, and there are things that evolutionary computation can learn but gradient descent probably cannot.

How are those philosophical concepts similar to these algorithms? In gradient descent, you are performing frequent updates in the direction that minimizes error. The error signal can be seen as causal: when there is an error, that error causes the model to change in a particular way. This is the same process as when a new observation causes a change in a person's belief ("writing our experience on the blank slate of the mind"), within the empiricist model of induction. These updates are frequent, making sure that every signal has a distinct impression on the model (batch learning is often used with gradient descent, but generally seen as a necessary evil). In contrast, in evolutionary computation, the change in the model is not directly caused by the error signal. The change is stochastic, not directly dependent on the error and not in general in the direction that minimizes the error, and in general much less common. Thus the model can be seen as a hypothesis, which is tested through applying the fitness function. Models are generated not from the data, but from previous hypotheses and random changes; they are evaluated by testing their consequences using the fitness function. If they are good, they stay in the population and more hypotheses are generated from them; if they are bad, they die.

How about explicitly trying to falsify the hypothesis? This is a key part of the Popperian mode of knowledge acquisition, but it does not seem to be part of evolutionary computation per se. However, it is part of competitive coevolution. In competitive coevolution, two or more populations are kept, and the fitness of the individuals in one population are dependent on how well they are perform against individuals in the other population. For example, one population could contain predators and the other prey, or one could contain image generators and the other image recognizers. As far as I know, the first successful example of competitive coevolution was demonstrated in 1990; the core idea was later re-invented (though with gradient descent instead of evolutionary search) in 2014 as generative adversarial networks.

If you accept the idea that learning by gradient descent is fundamentally a form of induction as described by empiricists, and that evolutionary computation is fundamentally more like the hypothetico-deductive process of Popperian critical rationalism, where does this take us? Does it say anything about what these types of algorithms can and cannot do?

I believe so. I think that certain things are extremely unlikely to ever be learned by gradient descent. To take an obvious example, I have a hard time seeing gradient descent ever learning F=ma or E=mc2. It's just not in the data - it has to be invented. And before you reply that you have a hard time seeing how evolution could learn such a complex law, note that using evolutionary computation to discover natural laws of a similar complexity has been demonstrated almost a decade ago. In this case, the representation (mathematical expressions represented as trees) is distinctly non-differentiable, so could not even in principle be learned through gradient descent. I also think that evolutionary algorithms, working by fewer and bolder strokes rather than a million tiny steps, is more likely to learn all kinds of abstract concepts. Perhaps the area where this is likely to be most important is reinforcement learning, where allowing the reward to push the model around seems to not be a very good idea in general and testing and discarding complete strategies may be much better.

So what should we do? Combine multiple types of learning of course! There are already hundreds (or perhaps thousands) of researchers working on evolutionary computation, but for historical reasons the evolutionary computation community is rather dissociated from the community of researchers working on machine learning by gradient descent. Crossover between evolutionary learning and gradient descent yielded important insights three decades ago, and I think there is so much more to learn. When you think about it, our own intelligence is a combination of evolutionary learning and lifetime learning, and it makes sense to build artificial intelligence on similar principles.

I am not saying gradient descent is a dead end nor that it will necessarily be superseded. Backpropagation is obviously a tremendously useful algorithm and gradient descent a very powerful idea. I am also not saying that evolutionary algorithms are the best solution for everything - they very clearly are not (though some have suggested that they are the second best solution for everything). But I am saying that backpropagation is by necessity only part of the solution to the problem of creating learning machines, as it is fundamentally limited to performing induction, which is not how real discoveries are made.

Some more reading: Kenneth Stanley has though a lot about the advantages of evolution in learning, and he and his team has written some very insightful things about this. Several large AI labs have teams working on evolutionary deep learning, including Uber AI, Sentient Technologies, DeepMind, and OpenAI. Gary Marcus has recently discussed the virtues of "innateness" (learning on evolutionary timescales) in machine learning. I have worked extensively with evolutionary computation in game contexts, such as for playing games and generating content for games. Nine years ago, me and a perhaps surprising set of authors set out to briefly characterize the differences between phylogenetic (evolutionary) and ontogenetic (gradient descent-based) reinforcement learning. I don't think we got to the core of the matter back then - this blog post summarizes a lot of what I was thinking but did not know how to express properly then. Thanks to several dead philosophers for helping me express my thoughts better. There's clearly more serious thinking to be done about this problem.

I'm thinking about turning this blog post into a proper paper at some point, so feedback of all kinds is welcome.

6 comments:

onnlucky said...

Notice how blacks swans is a labeling/categorization problem. Whereas F=ma is a causal relationship. The first has no induction problem, there is no swan-ness except in our minds. The second has no induction problem, because you model causal relationships and predict from them, better predictions means less wrong. The problem of induction is only one of "ultimate reality".

Petrônio Cândido de Lima e Silva said...

Outstanding post! It is never late to remind that gradient based methods are applied on adjusting weights for a predefined network architecture while population based methods can evolve both architecture and weights.

Johnicholas said...

I wonder where Boltzmann machines fit in your philosophy. I am not a machine learning expert, but my impression from watching some of Geoff Hinton's videos is that he believes Boltzmann machines are the more general, more powerful learning algorithm that (deep) gradient descent is a fast approximation to.

Daniel Lemire said...

This line of thought seems closely related to the so-called Baldwin effect. Peter Turney has written interesting papers on the topic: https://arxiv.org/pdf/cs/0212036.pdf

Peter Turney said...

More about the Baldwin effect:

http://www.apperceptual.com/evolutionary-computation

Moshe Sipper said...

Nice piece. The philosophical ties are interesting.
Are these dining philosophers? ;-)
Some related stuff:
- Koza billed EC as an invention machine in Jaws 3 or 4.
- There's our examination of HUMIE winners, which spoke of invention and design as great EC application areas.
K. Kannappan, L. Spector, M. Sipper, T. Helmuth, W. Lacava, J. Wisdom, and O. Bernstein, Analyzing a Decade of Human-Competitive (“HUMIE”) Winners: What Can We Learn?, in Genetic Programming Theory and Practice XII (GPTP 2014), R. Riolo, B. Worzel, and M. Kotanchek, Eds. 2014, Springer.
- My KYS OFF challenge from ~15 years ago ... http://www.moshesipper.com/blog/whos-the-boss

Cool.