The major contribution of the work is showing that ResNet needs a number of parameters which is polynomial in the dataset size to converge to a global optimum in contrast to traditional neural nets which require an exponential number of parameters.
> The current paper proves gradient descent achieves zero training loss in polynomial time for a deep over-parameterized neural network with residual connections (ResNet).
What's the difference? Any point where the loss is zero is global minimum.
1. It's theoretically impossible to guarantee a convergence to global optima using gradient descent if the function is non-convex.
2. The only way to guarantee is to start the gradient descent from different points in the search space or try with different step sizes if the algorithm only starts from the same point in the search space.
3. Also does "achieving zero training loss" mean the network has converged to the global optima? I used to know you will get zero training loss even if you are at a local minima as well.
Please correct me if I am wrong.
This is false. See, e.g., [0][1].
2. I'm not really sure what the question is here.
3. If your loss is bounded from below (it is a square norm) by 0 and you achieve 0 loss, this means that 0 is a global optimum, since, by definition, no other objective value can be smaller than this number.
---
[0] Theorem A.2 in Udell's Generalized Low-Rank models paper https://arxiv.org/pdf/1410.0342.pdf
[1] B&V Convex Optimization (https://web.stanford.edu/~boyd/cvxbook/), Appendix B.1. In fact, I can't find the reference right now, but you can easily prove that GD with an appropriate step-size converges to a global optimum on this problem when initialized at (0,0), even though the problem is non-convex.
I think the OP intended and should have written:
"It's theoretically impossible to guarantee a convergence to global optima using gradient descent for an arbitrary non-convex function."
For example consider the function f(x)=sin^2(pi * x)+sin^2(pi * N/x) this function has multiple global minima at the divisors of N, where it is f(x)==0, if x or N/x is non-integer, it is guaranteed to be positive...
I am not taking a stance on if gradient descent does or does not guarantee finding global minima and is thus able to factorize cryptographic grade RSA products of primes, but the claim does appear to imply it.
Edit: the multiply symbols changed some cursive
2) This is "a way" not "The only way".
(If A then B) does not imply (if not A then not B)
I don't understand. How do you prove a gradient descent is guaranteed to escape local minima?
Re 2: No.
Re 3: Yes.
Anyway this universal property of neural networks get a lot of airtime and people go gaga over it. Its a complete red herring. Its not the first example of a universal approximation and not the last. There is no scarcity of universal approximators. There was no such scarcity even 100s of years ago. The explanation of the success of DNNs lie elsewhere.
https://calculatedcontent.com/2018/09/21/rank-collapse-in-de...
But they miss something..the weight matrices also display power law behavior.
https://calculatedcontent.com/2018/09/09/power-laws-in-deep-...
This is also important because it was suggested in the early 90s that Heavy Tailed Spin Glasses would have a single local mimima.
This fact is the basis of my early suggestion that DNNs would exhibit a spin funnel
In summary, the difference between MSR paper and this paper is: if H denotes the number of layers, let m denote the number of hidden nodes. MRS paper can show we only need to assume m > poly (H), using SGD, the model can find the global optimal. However, in Du et al.’s work, they have a similar result, but they have to assume m > 2^{O(H)}. Compared to MSR paper, Du et al.’s paper is actually pretty trivial.
> The current paper proves gradient descent achieves zero training loss in polynomial time for a deep over-parameterized neural network with residual connections (ResNet).
If this variant of gradient descent is able to reach global minima in polynomial time, and if neural networks are proven to approximate any function, then ostensibly this technique could be used to guarantee the lowest error possible in approximating any function. This seems incredibly important. Can someone correct my reading of the abstract?
But this says nothing about generalization, i.e. performance on test set -- which is what we really want.
1. Zero training loss is impossible in most networks because the last layer can only reach the targets asymptotically.
2. Zero training loss means nothing from a practical standpoint. We've had algorithms capable of this for a long time (knn [k=1], decision trees etc.).
2. You clearly have no idea what you are talking about. This paper is trying to argue a bit about why neural networks generalize well by showing with math that a nn with some of their conditions converges to the zero training loss. It isn't remotely meant to be practical. IT IS A THEORETICAL PAPER.
And comparing it to nearest neighbors of 1 is so so so so so silly it isn't even wrong.
edit. #1 is actually an entire research direction in the theory of machine learning fyi.
It is possible to get neural networks that massively overfit but still generalize (which Is weird).
https://arxiv.org/pdf/1611.03530.pdf
That paper was really famous. It showed you can get zero training loss on data when you replace the labels with random noise.
edit 2: I am sorry to be harsh. It is just hard to read such arrant nonsense.
You build the circuit corresponding to the function and map it to an NN. Can this be discovered easily via GD? Absolutely no clue (though this paper says "yes"), but is it possible to approximate it? Yes, you can nail it exactly in a polynomial number of layers (if the algorithm takes poly-space, which is a necessary condition for it to run in poly-time).
Can someone expand on this? I've never heard of this before, at least not in the general case.
They shuffled the labels on their datasets, so there can’t possibly be anything to learn, yet got zero training loss, meaning the network must be severely overfitting. Yet the same network trained with the actual labels shows quite good generalization. So the usual intuition about overfitting and the bias-variance tradeoff doesn’t seem to apply.
“Understanding Deep Learning Requires REMEMBERING Generalization”
https://calculatedcontent.com/2018/04/01/rethinking-or-remem...
https://arxiv.org/abs/1710.09553
We can understand this using the traditional theory of Statistical Mechanics of Generalization
Briefly, shuffling the labels corresponds to decreasing the effective load on the Neural Network, which pushes the system into the spin glass phase
When you have nothing to learn, you need to memorize the data. But when there is structure, it is easier to memorize the structure, so the network will learn this first (and will memorize after).