The key claim in the article, that gradient descent could not discover physics from equations seems, like it is a statement about neural networks, not gradient descent. Given sufficient training data, a neural network can probably learn to model physics. I sympathize with the concern that it's very difficult to translate a neural network's knowledge into human concepts, but I see no reason to believe that optimizing the same system with an evolutionary algorithm would make this problem any easier. You could e.g. try to do program induction (which was supposed to be the future of AI many decades ago) instead of modeling the data directly, but choosing to perform program induction does not preclude the use of a neural network. Neural networks trained by gradient descent can generate ASTs (e.g. http://nlp.cs.berkeley.edu/pubs/Rabinovich-Stern-Klein_2017_...).
[Edited to remove reference to universal approximation; as comments point out, even if a neural network can approximate a function, it isn't guaranteed to be able to learn it. But I am reasonably confident that a neural network can learn Newton's second law.]
The ability of a system of linked functions to approximate any continuous function seems rather far from the ability to "learn modern physics".
It would seem like knowing modern physics would involve symbolic calculations rather than just approximating the behavior of any system.
I want to see a neutral network that correctly solves SAT-3.
It just says there are weights to approximate any function, not that you can actually learn the weights. Neural networks trivially can't learn how to approximate noncomputable functions to any accuracy, and there might be a lot of other functions that neural networks are terrible at actually learning.
Learning from examples and generalizing is a much different problem from function approximation.
Now, your point about to what extent this is really about neural networks is a good one. Could a network learn F=ma, even if we could not interpret it? Maybe. With the right data, represented the right way.
No, it is not and may be counter resultive, so to say.
https://arxiv.org/pdf/1605.02026.pdf - page 8, figure 2(b). SGD optimized neural networks stops learning at the accuracy at which whole-dataset methods start!
Also please note that the figure I pointed to is about high energy particles analysis. SGD trained NN cannot even distinguish particles with good precision, let alone discover physics.
Gradient descent and evolutionary algorithms (and many other search algorithms) advance in the hypothesis space with incremental (stochastic) steps and both algorithms are path dependent. How they generate and update their hypothesis, how big steps they take, how they represent their state, and how they apply randomness creates unique learning bias but there is nothing fundamentally different.
Maybe basic Newtonian physics, but I seriously doubt any ANN we've built to date could come up with QM or Relativity no matter how wonderfully massive and accurate the data was.
Looks to me like those required sophisticated conceptual understanding of the world in addition to leaps of the imagination and creative thought experiments.
But the argument here about why gradient descent won't be able to learn certain things is weak. Thought experiments are not a reliable guide to what what GD can or can't do.
It's fair enough to say that F=ma and E=mc² aren't in the data. Indeed, it took thousands of years of human thought to arrive at them. So the argument "it's not clear how an algorithm could extract F=ma from the data" isn't a strong criticism, because humans also can't do it by induction.
The long process culminating in F=ma involved a lot of abstract symbolic thought. Whether human-level abstract symbolic thought can be learned through GD (probably in combination with some sort of Monte Carlo tree search) is an open question. It can only be answered by trying to build things and seeing if they work.
If you want to make an argument about the limits of GD and induction, it'd be better to compare to a problem humans can solve reliably, rather than an insight that one genius had after decades of thought while standing on the shoulders of other geniuses.
I think the distinction here should be between dataset based learning and simulator based learning. The genetic algorithms mentioned in the article rely on a dynamic environment, not a static dataset. Given the dynamic environment (which is like an infinite dataset) gradient methods can learn just as well - look at AlphaGo for example. But when the model can't experiment / try new actions and see the effects, it can't separate causes from correlations.
You can extract only so much from a dataset, the model needs a way to cause and observe external effects. The environment could be the real world, a simulated world, a game, a meta neural net optimiser (AutoML), or any domain where the model can act and influence the path of learning and the environment by its previous actions.
I'm happy to see the boom in RL and simulator based learning in the last few years. It means we are on the right track.
There is ongoing work to try to see if we can learn NN-based controllers that do their own learning or thinking. The Neural Turing Machine work being the most famous, but there's also some amount of single-shot/few-shot learning literature that is exploring other ways of learning because trying to learn from examples via SGD is too slow.
EAs are certainly interesting and I think are show some interesting results for hyperparameter tuning/architecture search of neural nets where we don't have gradients, and doing it with far less computation than RL solutions.
A B C
1 4 3
20 45 15
8 15 7
And so on for some arbitrary number of rows, you can look at the table all you want but you will not perceive "A+C=B". It's just not written there. To get A+C=B you have to generate something else in addition to the table, namely a hypothesis- but this is a creative act, not an empirical one.How would you have measured the speed of light in the forties? Even having lots of test cases, you wouldn't be able to deduce something like that with test cases alone.
With Genetic algorithms, just one kind of evolutionary algorithm, the fitness, the mutation and the crossover function seem to require the implementor to look at the problem domain and "come up with something" whereas once you have a goal, gradient descent requires lots of tuning but is more or less defined, you can track how well you're doing and so-forth.
Perhaps there's something I'm missing. Pointers would be welcome.
Looked at: https://en.wikipedia.org/wiki/Evolutionary_algorithm https://en.wikipedia.org/wiki/Genetic_algorithm
So you can make it work with generic models. Although as you say, many of the big successes like the Backgammon players and Kosa's work on electronic circuits used domain-specific models.
The example the author cites regarding evo algorithms learning physical laws is laughable - "It's just not in the data - it has to be invented" applies equally to both the backprop and the evolutionary learning algorithms.
"In this case, the representation (mathematical expressions represented as trees) is distinctly non-differentiable, so could not even in principle be learned through gradient descent."
This is incorrect, almost like saying NLP data is not differentiable. For instance, set this representation up as the output of a network (or, if you wanted to be fancier, the central component of an autoencoder), and see how well it predicts/correlates with the experimental data. This is the error, which is back-propagated through the network's nodes.
FWIW, many theoreticians believe that the unreasonable effectiveness of neural networks and especially transfer learning is a result of their well-suitedness to encode laws of physics and Euclidean geometry. The author's final points about a nine-year-old survey may be out of date w.r.t. contemporary neural networks, which often have spookily good local minima and do not behave in the way intuition about gradient descent might suggest.
agreed, especially with policy gradients.
> If the dimensionality is small, second-order methods (or approximations thereof) can do dramatically better yet.
i have not seen second order derivatives in practice, presumably due to memory limitations. can you point me to examples?
It's not about the network architecture, or gradient descent on their own - it's the interaction, the dynamical system over weight space that training is.
Behind every great modern deep learning result? An enormous hyperparameter search and lots of elbow grease to carefully tune that dynamical system juuuuust right so the weight particle ends up in just the right place when training finishes. Smells like evolution to me. Deepmind even formalized the evolutionary process a deep learning researcher runs manually when fine-tuning a model into population based training https://deepmind.com/blog/population-based-training-neural-n...