Neural Network Loss Landscapes: What do we know? (2021) (opens in new tab)

(damueller.com)

152 pointsbitforger3y ago23 comments

23 comments

The "wedge" part under "3. Mode Connectivity" has at least one obvious component: Neural networks tend to be invariant to permuting nodes (together with their connections) within a layer. Simply put, it doesn't matter in what order you number the K nodes of e.g. a fully connected layer, but that alone already means there are K! different solutions with exactly the same behavior. Equivalently, the loss landscape is symmetric to certain permutations of its dimensions.

This means that, at the very least, there are many global optima (well, unless all permutable weights end up with the same value, which is obviously not the case). The fact that different initializations/early training steps can end up in different but equivalent optima follows directly from this symmetry. But whether all their basins are connected, or whether there are just multiple equivalent basins, is much less clear. The "non-linear" connection stuff does seem to imply that they are all in some (high-dimensional, non-linear) valley.

To be clear, this is just me looking at these results from the "permutation" perspective above, because it leads to a few obvious conclusions. But I am not qualified to judge which of these results are more or less profound.

sdenton43y ago

The "functional equivalence" discussed in the past gets at some of this. There's definitely more going on than permutation symmetry; in particular, if two solutions are symmetrically related (and thus evaluating the same overall function) then ensembling them together shouldn't help. But ensembling /does/ still help in many cases.

The different solutions found in different runs likely share a lot of information, but learn some different things on the edges. It would be cool to isolate the difference between two networks...

evolvingstuff3y ago

Completely agree! Plus, less trivially, there can be a bunch of different link weight settings (for an assumed distribution of inputs) that result in nearly-symmetric behaviors, and then that is multiplied by the permutation results you have just mentioned! So, it's complicated...

ced3y ago

But whether all their basins are connected, or whether there are just multiple equivalent basins, is much less clear.

What do you mean by "connected"?

bonoboTP3y ago

That there is a continuous path of low loss between them.

evolvingstuff3y ago

Here are some "animated" loss landscapes I made quite a long time ago:

http://evolvingstuff.blogspot.com/2011/02/animated-fractal-f...

These are related to recurrent neural networks evolved to maximize fitness whilst wandering through a randomly generated maze and picking up food pellets (the advantage being to remember not to revisit where you have already been.)

gildandstain3y ago

These are wonderful, great work.

evolvingstuff3y ago

Thank you!

YeGoblynQueenne3y ago

>> This leads to one of the key questions of deep learning, currently: Why do neural networks prefer solutions that generalize to unseen data, rather than settling on solutions which simply memorize the training data without actually learning anything?

That's the researchers who prefer these solutions, not the networks. And that's how the networks find them: because the experimenters have access to the test data and they keep tuning their networks' parameterers until they perfectly fit not only the training, but also the _test_ data.

In that sense the testing data is not "unseen". The neural net doesn't "see" it during training but the researchers do and they can try to improve the network's performance on it, because they control everything about how the network is trained, when it stops training etc etc.

It's nothing to do with loss functions and the answers are not in the maths. It's good, old researcher bias and it has to be controlled by othear means, namely, rigorous design _and description_ of experiments.

YeGoblynQueenne3y ago

As an addendum, note that training for a competition does not eliminate this overfitting to the test set. Most competitions make the test set instances available, though not their labels. Many restrict the number of submissions that can be made but usually accept several. There's even a bit of jargon regarding how to game this, it's called "hill climbing the test set" (really, it's hill climbing the performance on the test set, i.e. it's the accuracy on the test set that's optimised). Here's an actual how-to:

https://blockgeni.com/how-to-hill-climb-the-test-set-for-mac...

One benchmark I know where the test set is completely hidden is François Chollet's ARC dataset, and that's done precisely to preclude overfitting to the test set.

geysersam3y ago

This explanation is insufficient. Even if it explains good performance on the test data, it does not explain the typically good performance on real world data never seen during the training process.

It appears (somewhat) generalizing models are easier to compute than models that do not generalize at all.

YeGoblynQueenne3y ago

>> Even if it explains good performance on the test data, it does not explain the typically good performance on real world data never seen during the training process.

You'll have to clarify this because I'm not sure what you mean by "real world data". Do you mean e.g. data that is made available after a machine learning system is deployed "live"?

As far as I can tell, nobody really does this kind of "extrinsic" evaluation systematically, first of all because it is very expensive: such "real world data" is unlabelled, and must be labelled before the evaluation.

What's more, the "real world data" is very likely to change between deployments of a machine learning system so any evaluation of a model trained last month may not be valid this month.

So this is all basically very expensive in terms of both money and effort (so, money), and so nobody does it. Instead everyone relies on the approximation of real-world performance on their already labelled datasets.

1 more reply

ogrisel3y ago

There is also this very interesting 2017 paper:

Towards Understanding Generalization of Deep Learning: Perspective of Loss Landscapes

Lei Wu, Zhanxing Zhu, Weinan E

https://arxiv.org/abs/1706.10239

I think it was the first paper to study the volume of the basins of attraction of good global minima and used the poisoning scheme to highlight the frequency of bad global minima that are typically not reachable found via SGD on the original dataset without poisoning.

yobbo3y ago

The loss function is on the parameter space, and "wide basins" having better generalizations is equivalent to saying regularizing (in whatever way) gives better generalization, since regularization constrains the parameter/and or function space in that way.

In small (two or three) dimensions, there are ways of visualizing overtraining/regularization/generalization with scatter plots (maybe coloured with output label) of activations in each layer. Training will form tighter "modes" in the activations, and the "low density" space between modes constitutes "undefined input space" to subsequent layers. Overtraining is when real data falls in these "dead" regions. The aim of regularization is to shape the activation distributions such that unseen data falls somewhere with non-zero density.

Training loss does not give any information on generalization here unless it shows you're in a narrow "well". The loss landscapes are high-dimensional and non-obvious to reason about, even in tiny examples.

HarHarVeryFunny3y ago

This has always been my intuitive explanation as well, and seems related to how a typical over-parameterized net can memorize a randomly labelled training set, yet the same net will be able to generalize (opposite of memorize) if trained on a meaningfully labelled training set (i.e. one where the labels are non-random, and correspond to features in the training data).

With the randomly labelled dataset these activation "modes" are essentially gerrymandered to fit the data since the datapoints have no common features correlated to the labels to cause it to do otherwise.

With the meaningfully labelled dataset, and a smooth loss landscape, multiple datapoints with common features & labels will be pushing these activation modes in the same direction creating "high density modes" within which meaningful generalization occurs.

Generalization, or lack of it, is of course also intimately related to adversarial attacks. It seems that what is going on there is that these high density modes are only disconnected from each other (by areas of low density) when considering the degrees of freedom of data on the training set manifold. In the unconstrained input space off the natural data manifold, these high density areas of different generalization are likely to be connected and it's easy to select an "unnatural feature" that will push a datapoint from mapping to one mode to another.

I've suggested this explanation of generalization a number of times over the years, and always had negative feedback from folk who think there's more to the "generalization mystery" than this.

Kalanos3y ago

This is why model simplicity is so important. When an algorithm has less parameters, it's forced to use those weights to find the most broadly applicable patterns possible, as opposed to noise, in the training data

"Why might SGD prefer basins that are flatter?" It's because they look at the derivative. When the bottom of the valley is flat they don't have enough momentum to get out.

I have observed the lottery ticket hypothesis.

charleshmartin3y ago

https://calculatedcontent.com/2015/03/25/why-does-deep-learn...

Suzuran3y ago

Is it bad if my initial thought upon reading the headline was that someone was training a neural network to recognize the meme?

bilsbie3y ago

I wanted to understand this but I just couldn’t get there.

richardatlarge3y ago

Yes, maybe someone can offer some links for background reading

lysecret3y ago

Do we know things? Let's find out.

j / k navigate · click thread line to collapse

23 comments

MauranKilom3y ago

sdenton43y ago

The different solutions found in different runs likely share a lot of information, but learn some different things on the edges. It would be cool to isolate the difference between two networks...

evolvingstuff3y ago

ced3y ago

But whether all their basins are connected, or whether there are just multiple equivalent basins, is much less clear.

What do you mean by "connected"?

bonoboTP3y ago

That there is a continuous path of low loss between them.

evolvingstuff3y ago

Here are some "animated" loss landscapes I made quite a long time ago:

http://evolvingstuff.blogspot.com/2011/02/animated-fractal-f...

gildandstain3y ago

These are wonderful, great work.

evolvingstuff3y ago

Thank you!

YeGoblynQueenne3y ago

https://blockgeni.com/how-to-hill-climb-the-test-set-for-mac...

One benchmark I know where the test set is completely hidden is François Chollet's ARC dataset, and that's done precisely to preclude overfitting to the test set.

geysersam3y ago

This explanation is insufficient. Even if it explains good performance on the test data, it does not explain the typically good performance on real world data never seen during the training process.

It appears (somewhat) generalizing models are easier to compute than models that do not generalize at all.

YeGoblynQueenne3y ago

>> Even if it explains good performance on the test data, it does not explain the typically good performance on real world data never seen during the training process.

You'll have to clarify this because I'm not sure what you mean by "real world data". Do you mean e.g. data that is made available after a machine learning system is deployed "live"?

What's more, the "real world data" is very likely to change between deployments of a machine learning system so any evaluation of a model trained last month may not be valid this month.

1 more reply

ogrisel3y ago

There is also this very interesting 2017 paper:

Towards Understanding Generalization of Deep Learning: Perspective of Loss Landscapes

Lei Wu, Zhanxing Zhu, Weinan E

https://arxiv.org/abs/1706.10239

yobbo3y ago

HarHarVeryFunny3y ago

I've suggested this explanation of generalization a number of times over the years, and always had negative feedback from folk who think there's more to the "generalization mystery" than this.

Kalanos3y ago

"Why might SGD prefer basins that are flatter?" It's because they look at the derivative. When the bottom of the valley is flat they don't have enough momentum to get out.

I have observed the lottery ticket hypothesis.

charleshmartin3y ago

https://calculatedcontent.com/2015/03/25/why-does-deep-learn...

Suzuran3y ago

Is it bad if my initial thought upon reading the headline was that someone was training a neural network to recognize the meme?

bilsbie3y ago

I wanted to understand this but I just couldn’t get there.

richardatlarge3y ago

Yes, maybe someone can offer some links for background reading

lysecret3y ago

Do we know things? Let's find out.

j / k navigate · click thread line to collapse