Empiricism and the limits of gradient descent (opens in new tab)

(togelius.blogspot.com)

108 pointstogelius8y ago44 comments

44 comments

There are a couple of factual errors here. First, the difference between backprop and evolution is smaller than the author indicates. The error signal used in modern backprop training is stochastic because it is computed on a minibatch (which is why it's called stochastic gradient descent). This stochasticity seems important to achieving good results. And the most popular evolutionary algorithm in the deep learning world is Evolution Strategies, which effectively approximates a gradient. Ordinary genetic algorithms are not gradient-based and have recently shown promise in limited domains, but can't compete with gradient-based algorithms for supervised learning.

The key claim in the article, that gradient descent could not discover physics from equations seems, like it is a statement about neural networks, not gradient descent. Given sufficient training data, a neural network can probably learn to model physics. I sympathize with the concern that it's very difficult to translate a neural network's knowledge into human concepts, but I see no reason to believe that optimizing the same system with an evolutionary algorithm would make this problem any easier. You could e.g. try to do program induction (which was supposed to be the future of AI many decades ago) instead of modeling the data directly, but choosing to perform program induction does not preclude the use of a neural network. Neural networks trained by gradient descent can generate ASTs (e.g. http://nlp.cs.berkeley.edu/pubs/Rabinovich-Stern-Klein_2017_...).

[Edited to remove reference to universal approximation; as comments point out, even if a neural network can approximate a function, it isn't guaranteed to be able to learn it. But I am reasonably confident that a neural network can learn Newton's second law.]

_0w8t8y ago

Recently I was involved in calibrating a thermal infrared camera at work. A colleague out of curiosity tried to use machine learning and ended up with model containing hundreds of parameters (weights). Yet it was not better than a trivial model using a Planck integral (based on simple assumption about how things worked) and a linear regression (to account for systematic errors), 2 parameters in total. And the simple model completely ignored time dependencies assuming thermal stabilization which could be accounted using a couple of extra parameters based on a typical solution of heat transfer equation. Initially it was puzzling as I thought that Planck integral should be easily modeled with basic blocks of ML models. But then I realized that Planck function in our case was integrated over a complex profile of an infrared filter and may not be something that is easy to capture within ML.

2sk218y ago

ML is a blunt instrument in such situations. I think your example illustrates the point of the original article very nicely.

joe_the_user8y ago

Given sufficient data, according to the Universal Approximation Theorem, a neural network can learn to model physics.

The ability of a system of linked functions to approximate any continuous function seems rather far from the ability to "learn modern physics".

It would seem like knowing modern physics would involve symbolic calculations rather than just approximating the behavior of any system.

AstralStorm8y ago

A lot of physics is functions with singularities. ANN can only approximate these to a specified limit...

I want to see a neutral network that correctly solves SAT-3.

roywiggins8y ago

> Given sufficient data, according to the Universal Approximation Theorem, a neural network can learn to model physics.

It just says there are weights to approximate any function, not that you can actually learn the weights. Neural networks trivially can't learn how to approximate noncomputable functions to any accuracy, and there might be a lot of other functions that neural networks are terrible at actually learning.

srean8y ago

I am waiting for this uneducated drivel of explaining NN performance by their 'universal function approximator property' to stop. There are tons other schemes that are also universal approximators, they were known before NN was a thing. Why don't we use those ? Why don't they work as well ?

Learning from examples and generalizing is a much different problem from function approximation.

1 more reply

simonster8y ago

It's a fair point that the Universal Approximation Theorem does not guarantee that the weights can be learned. OTOH, the physical laws that the article states a neural network cannot discover are computable functions.

2 more replies

togeliusOP8y ago

I had to make a few simplifications to spell out the differences clearly and avoid making the text infinitely long. It's true that most current gradient descent algorithms are stochastic because they are computed in batch mode, and that sophisticated evolution strategies approximate the gradient. I still think the differences are significant, in that evolution updates less often and the direction of the update is much less (if at all) dependent on the feedback.

Now, your point about to what extent this is really about neural networks is a good one. Could a network learn F=ma, even if we could not interpret it? Maybe. With the right data, represented the right way.

thesz8y ago

>This stochasticity seems important to achieving good results.

No, it is not and may be counter resultive, so to say.

https://arxiv.org/pdf/1605.02026.pdf - page 8, figure 2(b). SGD optimized neural networks stops learning at the accuracy at which whole-dataset methods start!

Also please note that the figure I pointed to is about high energy particles analysis. SGD trained NN cannot even distinguish particles with good precision, let alone discover physics.

nabla98y ago

Also. Neural networks commonly use dropout regularization. In dropout your train only fraction (typically 50%) of randomly selected neurons. Effectively creating essembles.

Gradient descent and evolutionary algorithms (and many other search algorithms) advance in the hypothesis space with incremental (stochastic) steps and both algorithms are path dependent. How they generate and update their hypothesis, how big steps they take, how they represent their state, and how they apply randomness creates unique learning bias but there is nothing fundamentally different.

goatlover8y ago

> Given sufficient training data, a neural network can probably learn to model physics.

Maybe basic Newtonian physics, but I seriously doubt any ANN we've built to date could come up with QM or Relativity no matter how wonderfully massive and accurate the data was.

Looks to me like those required sophisticated conceptual understanding of the world in addition to leaps of the imagination and creative thought experiments.

tlb8y ago

I'm optimistic about the potential for evolutionary algorithms. I've used both EAs and gradient descent in developing robot controllers.

But the argument here about why gradient descent won't be able to learn certain things is weak. Thought experiments are not a reliable guide to what what GD can or can't do.

It's fair enough to say that F=ma and E=mc² aren't in the data. Indeed, it took thousands of years of human thought to arrive at them. So the argument "it's not clear how an algorithm could extract F=ma from the data" isn't a strong criticism, because humans also can't do it by induction.

The long process culminating in F=ma involved a lot of abstract symbolic thought. Whether human-level abstract symbolic thought can be learned through GD (probably in combination with some sort of Monte Carlo tree search) is an open question. It can only be answered by trying to build things and seeing if they work.

If you want to make an argument about the limits of GD and induction, it'd be better to compare to a problem humans can solve reliably, rather than an insight that one genius had after decades of thought while standing on the shoulders of other geniuses.

visarga8y ago

The author of the article mistook the importance of environment interactivity, considering that the learning algorithm is the key difference.

I think the distinction here should be between dataset based learning and simulator based learning. The genetic algorithms mentioned in the article rely on a dynamic environment, not a static dataset. Given the dynamic environment (which is like an infinite dataset) gradient methods can learn just as well - look at AlphaGo for example. But when the model can't experiment / try new actions and see the effects, it can't separate causes from correlations.

You can extract only so much from a dataset, the model needs a way to cause and observe external effects. The environment could be the real world, a simulated world, a game, a meta neural net optimiser (AutoML), or any domain where the model can act and influence the path of learning and the environment by its previous actions.

I'm happy to see the boom in RL and simulator based learning in the last few years. It means we are on the right track.

Eridrus8y ago

Right, I think it's very easy to fall into a trap of thinking that something is too hard for ML methods, because the way we solve them is different than how computers can solve them.

There is ongoing work to try to see if we can learn NN-based controllers that do their own learning or thinking. The Neural Turing Machine work being the most famous, but there's also some amount of single-shot/few-shot learning literature that is exploring other ways of learning because trying to learn from examples via SGD is too slow.

EAs are certainly interesting and I think are show some interesting results for hyperparameter tuning/architecture search of neural nets where we don't have gradients, and doing it with far less computation than RL solutions.

perl4ever8y ago

I don't understand how E=mc^2 can not be in the data. If it's a universal law, isn't it in more (all) data than any pattern that isn't universal?

maxander8y ago

The article is making a simpler point than that. If I show you the table:

And so on for some arbitrary number of rows, you can look at the table all you want but you will not perceive "A+C=B". It's just not written there. To get A+C=B you have to generate something else in addition to the table, namely a hypothesis- but this is a creative act, not an empirical one.

5 more replies

qop8y ago

Data has to be obtainable and is often dimensional.

How would you have measured the speed of light in the forties? Even having lots of test cases, you wouldn't be able to deduce something like that with test cases alone.

1 more reply

joe_the_user8y ago

In wikipedia and numerous references, I get the impression that EAs, evolutionary algorithms, are a huge and fuzzily defined field.

With Genetic algorithms, just one kind of evolutionary algorithm, the fitness, the mutation and the crossover function seem to require the implementor to look at the problem domain and "come up with something" whereas once you have a goal, gradient descent requires lots of tuning but is more or less defined, you can track how well you're doing and so-forth.

Perhaps there's something I'm missing. Pointers would be welcome.

Looked at: https://en.wikipedia.org/wiki/Evolutionary_algorithm https://en.wikipedia.org/wiki/Genetic_algorithm

tlb8y ago

Salimans et al evolved weights in convolutional neural nets -- the same structure of network that works with deep reinforcement learning -- to play Atari games from pixels. https://blog.openai.com/evolution-strategies/

So you can make it work with generic models. Although as you say, many of the big successes like the Backgammon players and Kosa's work on electronic circuits used domain-specific models.

sepranu8y ago

The points the author makes about gradient descent are accurate, in a sense. However, they oversimplify the technique (as it is currently applied today) and the context in which it is used. It seems as if the author, like many others, has a basic understanding of the subject's basic mechanisms, but not the context in which experts understand them.

The example the author cites regarding evo algorithms learning physical laws is laughable - "It's just not in the data - it has to be invented" applies equally to both the backprop and the evolutionary learning algorithms.

"In this case, the representation (mathematical expressions represented as trees) is distinctly non-differentiable, so could not even in principle be learned through gradient descent."

This is incorrect, almost like saying NLP data is not differentiable. For instance, set this representation up as the output of a network (or, if you wanted to be fancier, the central component of an autoencoder), and see how well it predicts/correlates with the experimental data. This is the error, which is back-propagated through the network's nodes.

FWIW, many theoreticians believe that the unreasonable effectiveness of neural networks and especially transfer learning is a result of their well-suitedness to encode laws of physics and Euclidean geometry. The author's final points about a nine-year-old survey may be out of date w.r.t. contemporary neural networks, which often have spookily good local minima and do not behave in the way intuition about gradient descent might suggest.

hmartiros8y ago

In my experience if you have even a little smoothness in your problem's cost manifold, taking advantage of gradients is invaluable to sample efficiency. Many losses which don't seem differentiable can be reformulated as such - you can look around and see a wide array of algorithms being put into end-to-end learned frameworks. If the dimensionality is small, second-order methods (or approximations thereof) can do dramatically better yet. However, I'm also a fan of evolutionary algorithms. I see no reason why evolutionary rules can't be defined with awareness of gradient signals.

vinn1248y ago

> Many losses which don't seem differentiable can be reformulated as such...

agreed, especially with policy gradients.

> If the dimensionality is small, second-order methods (or approximations thereof) can do dramatically better yet.

i have not seen second order derivatives in practice, presumably due to memory limitations. can you point me to examples?

hmartiros8y ago

They aren't common in deep learning, but if you look to estimation problems like odometry, optimal control, and calibration, the typical approach is to build a least squares estimator that optimizes with a gauss-newton approximation to the Hessian, or other quasi-newton methods. Gradient descent comparatively exhibits very slow convergence in these cases, especially when there is a large condition number. In the case of an actual quadratic loss function, it can (by definition) be solved in one iteration if you have the Hessian. However, getting it efficiently within most learning frameworks is difficult, as they primarily only compute VJPs or HVPs.

pmalynin8y ago

Once again, there is a general conflation of evolutionary algorithms and learning without a differentiable error function. I've presented this argument again with a discussion with antirez here on HN [1], but the crux of is that Reinforcement Learning as it stands is optimization over (maybe)-non-differential errors. For AlphaGo there is no gradient, per se, that says you will optimize your wins if you go this way (now, it is optimized by training towards the win-rate "score", which could be an error score) -- look at REINFORCE for other variations. Evolutionary Learning and Reinforcement Learning as two sides of the same coin.

[1] https://news.ycombinator.com/item?id=16652138

creo8y ago

(This is of course my opinion, not scientific fact) The biggest argument for EA is that proper implementation does not get stuck in local minimum. GD and SGD have that tendency, if not in one spot then probably looping between many. Problem with EA by the other hand is that its mathematical model is based on probability which is quite tricky to operate on even for above-average programmer.

ssivark8y ago

I don't see why stochastic has to be worse than evolutionary algorithms. In high-dimensional spaces, there is a large space of possible "mutations". SGD just biases certain mutations based on the gradient from a minibatch. That sounds a lot like evaluating the fitness of a mutated population on a minibatch and culling members with low fitness. In fact, there are many demonstrations that the stochastic nature of SGD (coming from the use of minibatches) is crucial for effective learning.

ozy8y ago

Notice how blacks swans is a labeling/categorization problem. Whereas F=ma is a causal relationship. The first has no induction problem, there is no swan-ness except in our minds. The second has no induction problem, because you model causal relationships and predict from them, better predictions means less wrong. The problem of induction is only one of "ultimate reality".

andbberger8y ago

Meh, weird article. None of the nice modern results (imagenet family, etc) were achieved just through gradient descent - this article, like most, seems to be missing the forest for the trees with deep learning.

It's not about the network architecture, or gradient descent on their own - it's the interaction, the dynamical system over weight space that training is.

Behind every great modern deep learning result? An enormous hyperparameter search and lots of elbow grease to carefully tune that dynamical system juuuuust right so the weight particle ends up in just the right place when training finishes. Smells like evolution to me. Deepmind even formalized the evolutionary process a deep learning researcher runs manually when fine-tuning a model into population based training https://deepmind.com/blog/population-based-training-neural-n...

PeterisP8y ago

"An enormous hyperparameter search and lots of elbow grease to carefully tune that dynamical system" i.e. the classic GDGS (gradient descent by grad student) approach where you have a grad student train a system, decide in which direction the parameters should be updated (i.e. look at the gradient), tweak the system and repeat until convergence.

j / k navigate · click thread line to collapse

44 comments

simonster8y ago

_0w8t8y ago

2sk218y ago

ML is a blunt instrument in such situations. I think your example illustrates the point of the original article very nicely.

joe_the_user8y ago

Given sufficient data, according to the Universal Approximation Theorem, a neural network can learn to model physics.

The ability of a system of linked functions to approximate any continuous function seems rather far from the ability to "learn modern physics".

It would seem like knowing modern physics would involve symbolic calculations rather than just approximating the behavior of any system.

AstralStorm8y ago

A lot of physics is functions with singularities. ANN can only approximate these to a specified limit...

I want to see a neutral network that correctly solves SAT-3.

roywiggins8y ago

> Given sufficient data, according to the Universal Approximation Theorem, a neural network can learn to model physics.

srean8y ago

Learning from examples and generalizing is a much different problem from function approximation.

1 more reply

simonster8y ago

2 more replies

togeliusOP8y ago

thesz8y ago

>This stochasticity seems important to achieving good results.

No, it is not and may be counter resultive, so to say.

https://arxiv.org/pdf/1605.02026.pdf - page 8, figure 2(b). SGD optimized neural networks stops learning at the accuracy at which whole-dataset methods start!

Also please note that the figure I pointed to is about high energy particles analysis. SGD trained NN cannot even distinguish particles with good precision, let alone discover physics.

nabla98y ago

Also. Neural networks commonly use dropout regularization. In dropout your train only fraction (typically 50%) of randomly selected neurons. Effectively creating essembles.

goatlover8y ago

> Given sufficient training data, a neural network can probably learn to model physics.

Maybe basic Newtonian physics, but I seriously doubt any ANN we've built to date could come up with QM or Relativity no matter how wonderfully massive and accurate the data was.

Looks to me like those required sophisticated conceptual understanding of the world in addition to leaps of the imagination and creative thought experiments.

tlb8y ago

I'm optimistic about the potential for evolutionary algorithms. I've used both EAs and gradient descent in developing robot controllers.

But the argument here about why gradient descent won't be able to learn certain things is weak. Thought experiments are not a reliable guide to what what GD can or can't do.

visarga8y ago

The author of the article mistook the importance of environment interactivity, considering that the learning algorithm is the key difference.

I'm happy to see the boom in RL and simulator based learning in the last few years. It means we are on the right track.

Eridrus8y ago

Right, I think it's very easy to fall into a trap of thinking that something is too hard for ML methods, because the way we solve them is different than how computers can solve them.

perl4ever8y ago

I don't understand how E=mc^2 can not be in the data. If it's a universal law, isn't it in more (all) data than any pattern that isn't universal?

maxander8y ago

The article is making a simpler point than that. If I show you the table:

5 more replies

qop8y ago

Data has to be obtainable and is often dimensional.

How would you have measured the speed of light in the forties? Even having lots of test cases, you wouldn't be able to deduce something like that with test cases alone.

1 more reply

joe_the_user8y ago

In wikipedia and numerous references, I get the impression that EAs, evolutionary algorithms, are a huge and fuzzily defined field.

Perhaps there's something I'm missing. Pointers would be welcome.

Looked at: https://en.wikipedia.org/wiki/Evolutionary_algorithm https://en.wikipedia.org/wiki/Genetic_algorithm

tlb8y ago

So you can make it work with generic models. Although as you say, many of the big successes like the Backgammon players and Kosa's work on electronic circuits used domain-specific models.

sepranu8y ago

"In this case, the representation (mathematical expressions represented as trees) is distinctly non-differentiable, so could not even in principle be learned through gradient descent."

hmartiros8y ago

vinn1248y ago

> Many losses which don't seem differentiable can be reformulated as such...

agreed, especially with policy gradients.

> If the dimensionality is small, second-order methods (or approximations thereof) can do dramatically better yet.

i have not seen second order derivatives in practice, presumably due to memory limitations. can you point me to examples?

hmartiros8y ago

pmalynin8y ago

[1] https://news.ycombinator.com/item?id=16652138

creo8y ago

ssivark8y ago

ozy8y ago

andbberger8y ago

It's not about the network architecture, or gradient descent on their own - it's the interaction, the dynamical system over weight space that training is.

PeterisP8y ago

j / k navigate · click thread line to collapse