Gradient Descent Finds Global Minima of Deep Neural Networks (opens in new tab)

(arxiv.org)

133 pointssuperfx7y ago57 comments

57 comments

It's worth noting that the primary result of this paper has only to do with the error on the training data under empirical risk minimization. Zero training error =/= a model that generalizes. For any optimization problem, you can always add enough parameters to achieve zero error on a problem over a finite training set (imagine introducing enough variables to fully memorize the map from inputs to labels).

The major contribution of the work is showing that ResNet needs a number of parameters which is polynomial in the dataset size to converge to a global optimum in contrast to traditional neural nets which require an exponential number of parameters.

sytelus7y ago

There is bit of difference between fitting dataset to some convenient parameterized function vs finding global minima of non-convex function. Also, paper claims that this can be done in polynomial time.

> The current paper proves gradient descent achieves zero training loss in polynomial time for a deep over-parameterized neural network with residual connections (ResNet).

p1esk7y ago

There is bit of difference between fitting dataset to some convenient parameterized function vs finding global minima of non-convex function

What's the difference? Any point where the loss is zero is global minimum.

MAXPOOL7y ago

The way to tackle the problem you state would be finding similar bounds with regularization.

iandanforth7y ago

My hope is that as these bounds are refined we can start to do BotE calculations such as, "I have 50k training images of 512x512x3 and 1k classes, this means I'll need a Convolutional Resnet of at most 12 layers and 12M params to fit the training data so let's start at half that." Rather than today which is 'let's use resnet 101 and see if that works.'

trevyn7y ago

You have to include some sense of what the classes encode for this to have meaning — for example, “pictures of correct mathematical proofs” vs “pictures of incorrect mathematical proofs” is going to require a much different architecture than “pictures of squares” vs “pictures of circles”.

iandanforth7y ago

Interesting classes! To use the heuristic Andrew Ng proposed if a human could tell the difference between correct and incorrect proofs in 1 second then this problem is likely no harder than most image recognition problems. If, instead, we're talking about analysis that requires symbolic manipulation then we're pretty far outside of the current capabilities of convolutional/residual/fully connected nets for which the paper provides bounds.

ramgorur7y ago

I did not understand the paper very well.

1. It's theoretically impossible to guarantee a convergence to global optima using gradient descent if the function is non-convex.

2. The only way to guarantee is to start the gradient descent from different points in the search space or try with different step sizes if the algorithm only starts from the same point in the search space.

3. Also does "achieving zero training loss" mean the network has converged to the global optima? I used to know you will get zero training loss even if you are at a local minima as well.

Please correct me if I am wrong.

LolWolf7y ago

1. > It's theoretically impossible to guarantee a convergence to global optima using gradient descent if the function is non-convex.

This is false. See, e.g., [0][1].

2. I'm not really sure what the question is here.

3. If your loss is bounded from below (it is a square norm) by 0 and you achieve 0 loss, this means that 0 is a global optimum, since, by definition, no other objective value can be smaller than this number.

---

[0] Theorem A.2 in Udell's Generalized Low-Rank models paper https://arxiv.org/pdf/1410.0342.pdf

[1] B&V Convex Optimization (https://web.stanford.edu/~boyd/cvxbook/), Appendix B.1. In fact, I can't find the reference right now, but you can easily prove that GD with an appropriate step-size converges to a global optimum on this problem when initialized at (0,0), even though the problem is non-convex.

DoctorOetker7y ago

that reference you can't find right now seems rather pertinent?

I think the OP intended and should have written:

"It's theoretically impossible to guarantee a convergence to global optima using gradient descent for an arbitrary non-convex function."

For example consider the function f(x)=sin^2(pi * x)+sin^2(pi * N/x) this function has multiple global minima at the divisors of N, where it is f(x)==0, if x or N/x is non-integer, it is guaranteed to be positive...

I am not taking a stance on if gradient descent does or does not guarantee finding global minima and is thus able to factorize cryptographic grade RSA products of primes, but the claim does appear to imply it.

Edit: the multiply symbols changed some cursive

1 more reply

nshm7y ago

The title of the paper is really misleading. The comments here are even more misleading. The key is their theorem where they say "with high probably over random initialization". They initialize many times and sometimes it converges. Single initialization can stuck in local minimum of course.

DoctorOetker7y ago

but then there is very little of interest: (assuming enough smoothness and Lipschitz continuity) one expects every global minima to have a convex neighbourhood such that gradient descent starting within the neighbourhood reaches the global minimum. The initialize many times and sometimes it converges is just saying the obvious "there exist initial positions for which GD succeeds in finding a global minima"... is my interpretation correct?

1 more reply

TTPrograms7y ago

1) Just because you can prove convergence for convex functions does not mean you can't prove it for non-convex functions.

2) This is "a way" not "The only way".

(If A then B) does not imply (if not A then not B)

ramgorur7y ago

>Just because you can prove convergence for convex functions does not mean you can't prove it for non-convex functions.

I don't understand. How do you prove a gradient descent is guaranteed to escape local minima?

1 more reply

RandyRanderson7y ago

FF NNs of even one hidden layer are universal approximators. That is, they do find the global min. What this doesn't tell you is that it's likely a huge graph and will take a looong time to optimize for even trivial data sets. There's lots of proofs around. That's why SGD is used, and for only a small subset of training points at a time.

Re 2: No.

Re 3: Yes.

simonhughes227y ago

They are universal approximators, although the term approximator means there's an upper bound on the accuracy of how well they emulate a given function (based on the size of the network). However, just because they are universal approximators doesn't mean that you can automatically infer the optimal number of connections and weight for each of those connections in order to minimize the loss over some dataset (derived from some function). Being able to be a universal approximator does not imply you can automatically learn the best approximation of a given function. It's the difference between being capable of learning something, and having learned it. If that makes sense.

srean7y ago

I dont think you understand what universal approximation means. It means there are parameter settings that would reduce the approximation as much as you want. Its an existential property. It does not mean that those parameters can be found.

Anyway this universal property of neural networks get a lot of airtime and people go gaga over it. Its a complete red herring. Its not the first example of a universal approximation and not the last. There is no scarcity of universal approximators. There was no such scarcity even 100s of years ago. The explanation of the success of DNNs lie elsewhere.

charleshmartin7y ago

An excellent paper which uses (some of the) results we have also found studying the weight matrices of neural networks...namely that they rarely undergo rank collapse

https://calculatedcontent.com/2018/09/21/rank-collapse-in-de...

But they miss something..the weight matrices also display power law behavior.

https://calculatedcontent.com/2018/09/09/power-laws-in-deep-...

This is also important because it was suggested in the early 90s that Heavy Tailed Spin Glasses would have a single local mimima.

This fact is the basis of my early suggestion that DNNs would exhibit a spin funnel

gogut7y ago

This paper appears in November, but in fact, Allen-Zhu (MSR, http://people.csail.mit.edu/zeyuan/ ) already posted his result in Oct. This is their first paper in Oct:https://arxiv.org/pdf/1810.12065.pdf, this is their second paper in Nov https://arxiv.org/pdf/1811.03962.pdf . In MSR Oct paper, they proved how to train RNN (which is even harder than DNN). In their Nov paper, they proved how to train DNN. Compared to their Oct one, the Nov one is actually much easier. The reason is, in RNN, every layer has the same weight matrix, but in DNN every layer could have different weight matrices. Originally, they were not planning to write this DNN paper. Since someone is complaining that RNN is not multilayer neural network, that’s why they did it.

In summary, the difference between MSR paper and this paper is: if H denotes the number of layers, let m denote the number of hidden nodes. MRS paper can show we only need to assume m > poly (H), using SGD, the model can find the global optimal. However, in Du et al.’s work, they have a similar result, but they have to assume m > 2^{O(H)}. Compared to MSR paper, Du et al.’s paper is actually pretty trivial.

brentjanderson7y ago

Although I'm no expert, isn't this result an incredibly important contribution? This paper claims to prove that:

> The current paper proves gradient descent achieves zero training loss in polynomial time for a deep over-parameterized neural network with residual connections (ResNet).

If this variant of gradient descent is able to reach global minima in polynomial time, and if neural networks are proven to approximate any function, then ostensibly this technique could be used to guarantee the lowest error possible in approximating any function. This seems incredibly important. Can someone correct my reading of the abstract?

cosmic_ape7y ago

This is not necessarily a trivial fact, but I wouldn't call it incredible. It says a net trained with gradient descent can fit the input perfectly if its much larger than the input.

But this says nothing about generalization, i.e. performance on test set -- which is what we really want.

iandanforth7y ago

Importantly not SGD, just GD.

1 more reply

hooloovoo_zoo7y ago

Well, without reading the whole paper, two important things strike me.

1. Zero training loss is impossible in most networks because the last layer can only reach the targets asymptotically.

2. Zero training loss means nothing from a practical standpoint. We've had algorithms capable of this for a long time (knn [k=1], decision trees etc.).

cosmic_ape7y ago

Yeah. And they require that the number of parameters grows at least polynomially with the size of input. In this regime, we also have another algorithm capable of zero loss -- the linear regression!

skeptic_697y ago

1. people overfit the baby datasets to zero training loss (MNIST) all the time. maybe you meant a "hard" dataset.

2. You clearly have no idea what you are talking about. This paper is trying to argue a bit about why neural networks generalize well by showing with math that a nn with some of their conditions converges to the zero training loss. It isn't remotely meant to be practical. IT IS A THEORETICAL PAPER.

And comparing it to nearest neighbors of 1 is so so so so so silly it isn't even wrong.

edit. #1 is actually an entire research direction in the theory of machine learning fyi.

It is possible to get neural networks that massively overfit but still generalize (which Is weird).

https://arxiv.org/pdf/1611.03530.pdf

That paper was really famous. It showed you can get zero training loss on data when you replace the labels with random noise.

edit 2: I am sorry to be harsh. It is just hard to read such arrant nonsense.

2 more replies

Xcelerate7y ago

The paper claims to reach the global minima of a given neural network in polynomial time. The time complexity of constructing a neural network that approximates any function is an entirely different matter. I’m not even sure how one would begin to approximate a highly algorithmic process (e.g., a hash function) using a neural network.

LolWolf7y ago

Re: "I’m not even sure how one would begin to approximate a highly algorithmic process (e.g., a hash function) using a neural network"

You build the circuit corresponding to the function and map it to an NN. Can this be discovered easily via GD? Absolutely no clue (though this paper says "yes"), but is it possible to approximate it? Yes, you can nail it exactly in a polynomial number of layers (if the algorithm takes poly-space, which is a necessary condition for it to run in poly-time).

jey7y ago

I interpret it as saying: "non-convexity ain't no big deal in this context", and this is the latest contribution in a line of other "no spurious local minima" machine learning theory research. It's interesting for showing that it applies to a rather complex model like ResNet.

cosmic_ape7y ago

Its quite important that its not so much about "non-convexity", its more about "random initialization". All these results, and there are others, more sophisticated, basically say that after random init everything is already good, and the gradient descent just has to train the highest layer, without screwing up the previous ones. They are all pretty naive that way.

juskrey7y ago

How do you gradually detect a Dirac stick?

pj_mukh7y ago

With this much wide distribution of ML algorithms in use, its still funny to see papers begin with "One of the mysteries of deep learning is that...." and then goes on to lay out the multiple ways we have no idea why some of these DL techniques work.

1 more reply

orf7y ago

> One of the mysteries in deep learning is random initialized first order methods like gradient descent achieve zero training loss, even if the labels are arbitrary

Can someone expand on this? I've never heard of this before, at least not in the general case.

currymj7y ago

The paper “Understanding Deep Learning Requires Rethinking Generalization” was where this was first pointed out, I think.

They shuffled the labels on their datasets, so there can’t possibly be anything to learn, yet got zero training loss, meaning the network must be severely overfitting. Yet the same network trained with the actual labels shows quite good generalization. So the usual intuition about overfitting and the bias-variance tradeoff doesn’t seem to apply.

charleshmartin7y ago

I would say

“Understanding Deep Learning Requires REMEMBERING Generalization”

https://calculatedcontent.com/2018/04/01/rethinking-or-remem...

https://arxiv.org/abs/1710.09553

We can understand this using the traditional theory of Statistical Mechanics of Generalization

Briefly, shuffling the labels corresponds to decreasing the effective load on the Neural Network, which pushes the system into the spin glass phase

elcomet7y ago

This seems quite intuitive to me:

When you have nothing to learn, you need to memorize the data. But when there is structure, it is easier to memorize the structure, so the network will learn this first (and will memorize after).

1 more reply

lbj7y ago

I cant claim to fully understand the proof, but these guys have done an amazing job in terms of furthering our understanding of deep nets.

j / k navigate · click thread line to collapse

57 comments

fwilliams7y ago

sytelus7y ago

> The current paper proves gradient descent achieves zero training loss in polynomial time for a deep over-parameterized neural network with residual connections (ResNet).

p1esk7y ago

There is bit of difference between fitting dataset to some convenient parameterized function vs finding global minima of non-convex function

What's the difference? Any point where the loss is zero is global minimum.

MAXPOOL7y ago

The way to tackle the problem you state would be finding similar bounds with regularization.

iandanforth7y ago

trevyn7y ago

iandanforth7y ago

ramgorur7y ago

I did not understand the paper very well.

1. It's theoretically impossible to guarantee a convergence to global optima using gradient descent if the function is non-convex.

3. Also does "achieving zero training loss" mean the network has converged to the global optima? I used to know you will get zero training loss even if you are at a local minima as well.

Please correct me if I am wrong.

LolWolf7y ago

1. > It's theoretically impossible to guarantee a convergence to global optima using gradient descent if the function is non-convex.

This is false. See, e.g., [0][1].

2. I'm not really sure what the question is here.

---

[0] Theorem A.2 in Udell's Generalized Low-Rank models paper https://arxiv.org/pdf/1410.0342.pdf

DoctorOetker7y ago

that reference you can't find right now seems rather pertinent?

I think the OP intended and should have written:

"It's theoretically impossible to guarantee a convergence to global optima using gradient descent for an arbitrary non-convex function."

Edit: the multiply symbols changed some cursive

1 more reply

nshm7y ago

DoctorOetker7y ago

1 more reply

TTPrograms7y ago

1) Just because you can prove convergence for convex functions does not mean you can't prove it for non-convex functions.

2) This is "a way" not "The only way".

(If A then B) does not imply (if not A then not B)

ramgorur7y ago

>Just because you can prove convergence for convex functions does not mean you can't prove it for non-convex functions.

I don't understand. How do you prove a gradient descent is guaranteed to escape local minima?

1 more reply

RandyRanderson7y ago

Re 2: No.

Re 3: Yes.

simonhughes227y ago

srean7y ago

charleshmartin7y ago

An excellent paper which uses (some of the) results we have also found studying the weight matrices of neural networks...namely that they rarely undergo rank collapse

https://calculatedcontent.com/2018/09/21/rank-collapse-in-de...

But they miss something..the weight matrices also display power law behavior.

https://calculatedcontent.com/2018/09/09/power-laws-in-deep-...

This is also important because it was suggested in the early 90s that Heavy Tailed Spin Glasses would have a single local mimima.

This fact is the basis of my early suggestion that DNNs would exhibit a spin funnel

gogut7y ago

brentjanderson7y ago

Although I'm no expert, isn't this result an incredibly important contribution? This paper claims to prove that:

> The current paper proves gradient descent achieves zero training loss in polynomial time for a deep over-parameterized neural network with residual connections (ResNet).

cosmic_ape7y ago

This is not necessarily a trivial fact, but I wouldn't call it incredible. It says a net trained with gradient descent can fit the input perfectly if its much larger than the input.

But this says nothing about generalization, i.e. performance on test set -- which is what we really want.

iandanforth7y ago

Importantly not SGD, just GD.

1 more reply

hooloovoo_zoo7y ago

Well, without reading the whole paper, two important things strike me.

1. Zero training loss is impossible in most networks because the last layer can only reach the targets asymptotically.

2. Zero training loss means nothing from a practical standpoint. We've had algorithms capable of this for a long time (knn [k=1], decision trees etc.).

cosmic_ape7y ago

skeptic_697y ago

1. people overfit the baby datasets to zero training loss (MNIST) all the time. maybe you meant a "hard" dataset.

And comparing it to nearest neighbors of 1 is so so so so so silly it isn't even wrong.

edit. #1 is actually an entire research direction in the theory of machine learning fyi.

It is possible to get neural networks that massively overfit but still generalize (which Is weird).

https://arxiv.org/pdf/1611.03530.pdf

That paper was really famous. It showed you can get zero training loss on data when you replace the labels with random noise.

edit 2: I am sorry to be harsh. It is just hard to read such arrant nonsense.

2 more replies

Xcelerate7y ago

LolWolf7y ago

Re: "I’m not even sure how one would begin to approximate a highly algorithmic process (e.g., a hash function) using a neural network"

jey7y ago

cosmic_ape7y ago

juskrey7y ago

How do you gradually detect a Dirac stick?

pj_mukh7y ago

1 more reply

orf7y ago

> One of the mysteries in deep learning is random initialized first order methods like gradient descent achieve zero training loss, even if the labels are arbitrary

Can someone expand on this? I've never heard of this before, at least not in the general case.

currymj7y ago

The paper “Understanding Deep Learning Requires Rethinking Generalization” was where this was first pointed out, I think.

charleshmartin7y ago

I would say

“Understanding Deep Learning Requires REMEMBERING Generalization”

https://calculatedcontent.com/2018/04/01/rethinking-or-remem...

https://arxiv.org/abs/1710.09553

We can understand this using the traditional theory of Statistical Mechanics of Generalization

Briefly, shuffling the labels corresponds to decreasing the effective load on the Neural Network, which pushes the system into the spin glass phase

elcomet7y ago

This seems quite intuitive to me:

When you have nothing to learn, you need to memorize the data. But when there is structure, it is easier to memorize the structure, so the network will learn this first (and will memorize after).

1 more reply

lbj7y ago

I cant claim to fully understand the proof, but these guys have done an amazing job in terms of furthering our understanding of deep nets.

j / k navigate · click thread line to collapse