Tips for Better Deep Learning Models (opens in new tab)

(lauradhamilton.com)

56 pointslauradhamilton11y ago15 comments

15 comments

A note on dropout:

If your layer size is relatively small (not hundreds or thousands of nodes), dropout is usually detrimental and a more traditional regularization method such as weight-decay is superior.

For the size networks Hinton et al are playing with nowadays (with thousands of nodes in a layer), dropout is good, though.

agibsonccc11y ago

I've found a combination of the 2 to be great. Most deep networks (even just the feed forward variety) tend to generalize better with mini batch samples of random drop out on multiple epochs. This is true of both images and word vector representations I've worked with.

gamegoblin11y ago

I've found that with a large enough network, using the two together is good, but as your network grows smaller and you lose redundancy, dropout starts to hurt you when compared with using weight-decay alone.

In huge networks in which you have a lot of non-independent feature detectors, your network can tolerate to have ~50% of them dropped out and then improves when you use them all at once, but in small networks when you have a mostly independent features (at least in some layer), using dropout can cause the feature detectors to trash a fail to properly stabilize.

Consider a 32-16-10 feedforward network with binary stochastic units. If all 10 output bits are independent of each other, and you apply dropout to the hidden layer, your expected number of nodes in 8, so you lose information (since the output bits are independent of each other) without any hope of getting it back.

1 more reply

vundervul11y ago

Who is Arno Candel and why should we pay attention to his tips on training neural networks? Anyone who suggests grid search for metaparameter tuning is out of touch with the consensus among experts in deep learning. A lot of people are coming out of the woodwork and presenting themselves as experts in this exciting area because it has had so much success recently, but most of them seem to be beginners. Having lots of beginners learning is fine and healthy, but a lot of these people act as if they are experts.

fredmonroe11y ago

his linkedin profile looks pretty legit to me. http://www.linkedin.com/in/candel I wouldn't want to get into a ML dick measuring contest with him anyway. H20 looks awesome too.

I think you are misinterpreting what he is saying about grid search. The grid search is just to narrow the field of parameters initially, he doesn't say how he would proceed after that point.

Just curious, what do you consider the state of the art? A Bayesian optimization? Wouldn't a grid search to start be like a uniform prior?

The rest of his suggestions looked on point to me, did you see anything else you would differ with? (i ask sincerely for my own education).

davmre11y ago

The point of Bayesian optimization is that, once you've evaluated your first parameter setting (or round of settings, if you're running in parallel), you no longer have a uniform prior; you have a posterior: that evaluation gave you new information, and you should use that to be smart about where you evaluate next instead of just continuing with a fixed grid.

There's also an argument (http://jmlr.org/papers/volume13/bergstra12a/bergstra12a.pdf) that random search is better than grid search, because when only a few of the parameters really matter (but you don't know which ones), grid search wastes effort on scanning the unimportant parameters with the important parameters held fixed, but each point in a random search evaluates a new setting of the important parameters.

All this said, certainly grid search is way better than not optimizing parameters at all. My guess is that was the spirit in which this suggestion was made, so I wouldn't take it as a reason to discount the guy.

vundervul11y ago

Bayesian optimization > random search > grid search

Grid search is nothing like a uniform prior since you would never get a grid search-like set of test points in a sample from a uniform prior.

I didn't really want to write a list of criticism for what is presumably a smart and earnest gentleman and the similarly smart and earnest woman who summarized the tips from his talk, but here goes:

The H2O architecture looks like a great way to get a marginal benefit from lots of computers and is not something that actually solves the parallelization problem well at all.

Using reconstruction error of an autoencoder for anomaly detection is wrong and dangerous so it is a bad example to use in a talk.

Adadelta isn't necessary and great results can be obtained with much simpler techniques. It is a perfectly good thing to use, but it isn't a great tip in my mind. This isn't something I would put on a list of tips.

In general, the list of tips doesn't just doesn't seem very helpful.

1 more reply

agibsonccc11y ago

I would just like to link to my comments from before for people who maybe curious:

https://news.ycombinator.com/item?id=7803101

I will also add that looking in to hessian free for training over conjugate gradient/LBFGS/SGD for feed forward nets has proven to be amazing[1].

Recursive nets I'm still playing with yet, but based on the work by socher, they used LBFGS just fine.

[1]: http://www.cs.toronto.edu/~rkiros/papers/shf13.pdf

[2]: http://socher.org/

prajit11y ago

A question about the actual slides: why don't they use unsupervised pretraining (i.e. Sparse Autoencoder) for predicting MNIST? Is it just to show that they don't need pretraining to achieve good results or is there something deeper?

colincsl11y ago

I've only been watching from the Deep Learning sidelines -- but I believe people have steered away from pretraining over the past year or two. I think on practical datasets it doesn't seem to help.

TrainedMonkey11y ago

Direct link to slides: http://www.slideshare.net/0xdata/h2o-distributed-deep-learni...

ivansavz11y ago

direct link to slides anyone?

kdavis11y ago

http://www.slideshare.net/mobile/0xdata/h2o-distributed-deep...

j / k navigate · click thread line to collapse

15 comments

gamegoblin11y ago

A note on dropout:

If your layer size is relatively small (not hundreds or thousands of nodes), dropout is usually detrimental and a more traditional regularization method such as weight-decay is superior.

For the size networks Hinton et al are playing with nowadays (with thousands of nodes in a layer), dropout is good, though.

agibsonccc11y ago

gamegoblin11y ago

1 more reply

vundervul11y ago

fredmonroe11y ago

his linkedin profile looks pretty legit to me. http://www.linkedin.com/in/candel I wouldn't want to get into a ML dick measuring contest with him anyway. H20 looks awesome too.

I think you are misinterpreting what he is saying about grid search. The grid search is just to narrow the field of parameters initially, he doesn't say how he would proceed after that point.

Just curious, what do you consider the state of the art? A Bayesian optimization? Wouldn't a grid search to start be like a uniform prior?

The rest of his suggestions looked on point to me, did you see anything else you would differ with? (i ask sincerely for my own education).

davmre11y ago

vundervul11y ago

Bayesian optimization > random search > grid search

Grid search is nothing like a uniform prior since you would never get a grid search-like set of test points in a sample from a uniform prior.

I didn't really want to write a list of criticism for what is presumably a smart and earnest gentleman and the similarly smart and earnest woman who summarized the tips from his talk, but here goes:

The H2O architecture looks like a great way to get a marginal benefit from lots of computers and is not something that actually solves the parallelization problem well at all.

Using reconstruction error of an autoencoder for anomaly detection is wrong and dangerous so it is a bad example to use in a talk.

In general, the list of tips doesn't just doesn't seem very helpful.

1 more reply

agibsonccc11y ago

I would just like to link to my comments from before for people who maybe curious:

https://news.ycombinator.com/item?id=7803101

I will also add that looking in to hessian free for training over conjugate gradient/LBFGS/SGD for feed forward nets has proven to be amazing[1].

Recursive nets I'm still playing with yet, but based on the work by socher, they used LBFGS just fine.

[1]: http://www.cs.toronto.edu/~rkiros/papers/shf13.pdf

[2]: http://socher.org/

prajit11y ago

colincsl11y ago

I've only been watching from the Deep Learning sidelines -- but I believe people have steered away from pretraining over the past year or two. I think on practical datasets it doesn't seem to help.

TrainedMonkey11y ago

Direct link to slides: http://www.slideshare.net/0xdata/h2o-distributed-deep-learni...

ivansavz11y ago

direct link to slides anyone?

kdavis11y ago

http://www.slideshare.net/mobile/0xdata/h2o-distributed-deep...

j / k navigate · click thread line to collapse