undefined | Better HN

0 pointsDoctorOetker7y ago0 comments

that reference you can't find right now seems rather pertinent?

I think the OP intended and should have written:

"It's theoretically impossible to guarantee a convergence to global optima using gradient descent for an arbitrary non-convex function."

For example consider the function f(x)=sin^2(pi * x)+sin^2(pi * N/x) this function has multiple global minima at the divisors of N, where it is f(x)==0, if x or N/x is non-integer, it is guaranteed to be positive...

I am not taking a stance on if gradient descent does or does not guarantee finding global minima and is thus able to factorize cryptographic grade RSA products of primes, but the claim does appear to imply it.

Edit: the multiply symbols changed some cursive

0 comments

LolWolf7y ago

> that reference you can't find right now seems rather pertinent?

Here it is: https://arxiv.org/pdf/1707.08706.pdf (This isn't quite the one I was thinking of, so I'll dig a little deeper, but it covers the idea).

Some slightly more technical conditions have to hold in order to have vanilla GD work (since the function is non-differentiable at points), but a (very!) slightly generalized variant for convex functions—sub-gradient descent—works.

> for an arbitrary non-convex function.

Sure, but this is also obvious since it is NP-hard to reach global optima in arbitrary non-convex problems. Additionally, specifically on the case of GD, I can give simple examples that always fail (consider f(x) = 1 everywhere except f(0) = 0. GD always fails whenever the initial point, x_0 ≠ 0, since the gradient is zero everywhere, except at one point. Additionally, picking initializations randomly, we reach the global minimum with probability 0 whenever we have support with non-empty interior).

I'm afraid I disagree that this is what the OP intended, though, and I also disagree that the paper's claim implies what you've said, since they only study a very specific subproblem (e.g. minimization of empirical loss on ResNets).

The relative "ease" of this task vs solving arbitrary NP-hard problems is not difficult to believe, since, given a bunch of training examples, I can always generate a resnet that fits those examples perfectly (i.e. with zero loss) in poly-space in a very dumb way: first, generate a circuit that matches the look-up table of the training samples (which is poly-space in the number of samples and can be done in poly-time), then map that circuit to an NN.

DoctorOetkerOP7y ago

>Some slightly more technical conditions have to hold in order to have vanilla GD work (since the function is non-differentiable at points)

which function is non-differentiable at points? if you refer to my example, it is only nondifferentiable at x=0 and x=inf which are both uninteresting points since they arent divisors of N, for all the rest f(x) I gave is differentiable and lipschitz continuous of order infinity

This in contrary to your pathological example of f(x)= { 1 (x!=0); 0 (x==0) ... of course GD can not work there, and I wouldn't fault the paper for it...

don't misunderstand me, the paper is interesting, but the title and certain phrasings are very misleading IMHO

Still I think the approach by others is more interesting: by looking at the absolute error between a fixed underlying NN as "ground truth" and observing the error of the training NN (of same architecture as ground truth NN) trained to match the underlying NN

LolWolf7y ago

> which function is non-differentiable at points?

Sorry, this was referring to the construction provided in the paper referenced.

I do agree that the title is somewhat misleading, since, when I first read it (and thought, "this is probably wrong"), I imagined that it proved that given any resnet, you can show convergence to the global optimum via GD, not just "a resent of a given size converges to a global optimum, via GD, for a specific training set."

That being said, the paper does not prove (nor claim to prove) general, globally-optimal convergence of GD, which is what I think you're saying (given, for example, what you mentioned about finding the factorization of a semiprime in the GGP and your specific function construction)—which is what I was pushing back against a bit. In particular, even in the title, they only claim to prove this for a specific class of problems (i.e. NNs).

> Still I think the approach by others is more interesting: by looking at the absolute error between a fixed underlying NN as "ground truth" and observing the error of the training NN (of same architecture as ground truth NN) trained to match the underlying NN

I'm afraid I haven't seen this approach, but it would be interesting. Do you have references?

1 more reply

j / k navigate · click thread line to collapse

0 comments

LolWolf7y ago

> that reference you can't find right now seems rather pertinent?

Here it is: https://arxiv.org/pdf/1707.08706.pdf (This isn't quite the one I was thinking of, so I'll dig a little deeper, but it covers the idea).

> for an arbitrary non-convex function.

DoctorOetkerOP7y ago

>Some slightly more technical conditions have to hold in order to have vanilla GD work (since the function is non-differentiable at points)

This in contrary to your pathological example of f(x)= { 1 (x!=0); 0 (x==0) ... of course GD can not work there, and I wouldn't fault the paper for it...

don't misunderstand me, the paper is interesting, but the title and certain phrasings are very misleading IMHO

LolWolf7y ago

> which function is non-differentiable at points?

Sorry, this was referring to the construction provided in the paper referenced.

I'm afraid I haven't seen this approach, but it would be interesting. Do you have references?

1 more reply

j / k navigate · click thread line to collapse