If your layer size is relatively small (not hundreds or thousands of nodes), dropout is usually detrimental and a more traditional regularization method such as weight-decay is superior.
For the size networks Hinton et al are playing with nowadays (with thousands of nodes in a layer), dropout is good, though.
In huge networks in which you have a lot of non-independent feature detectors, your network can tolerate to have ~50% of them dropped out and then improves when you use them all at once, but in small networks when you have a mostly independent features (at least in some layer), using dropout can cause the feature detectors to trash a fail to properly stabilize.
Consider a 32-16-10 feedforward network with binary stochastic units. If all 10 output bits are independent of each other, and you apply dropout to the hidden layer, your expected number of nodes in 8, so you lose information (since the output bits are independent of each other) without any hope of getting it back.
I think you are misinterpreting what he is saying about grid search. The grid search is just to narrow the field of parameters initially, he doesn't say how he would proceed after that point.
Just curious, what do you consider the state of the art? A Bayesian optimization? Wouldn't a grid search to start be like a uniform prior?
The rest of his suggestions looked on point to me, did you see anything else you would differ with? (i ask sincerely for my own education).
There's also an argument (http://jmlr.org/papers/volume13/bergstra12a/bergstra12a.pdf) that random search is better than grid search, because when only a few of the parameters really matter (but you don't know which ones), grid search wastes effort on scanning the unimportant parameters with the important parameters held fixed, but each point in a random search evaluates a new setting of the important parameters.
All this said, certainly grid search is way better than not optimizing parameters at all. My guess is that was the spirit in which this suggestion was made, so I wouldn't take it as a reason to discount the guy.
Grid search is nothing like a uniform prior since you would never get a grid search-like set of test points in a sample from a uniform prior.
I didn't really want to write a list of criticism for what is presumably a smart and earnest gentleman and the similarly smart and earnest woman who summarized the tips from his talk, but here goes:
The H2O architecture looks like a great way to get a marginal benefit from lots of computers and is not something that actually solves the parallelization problem well at all.
Using reconstruction error of an autoencoder for anomaly detection is wrong and dangerous so it is a bad example to use in a talk.
Adadelta isn't necessary and great results can be obtained with much simpler techniques. It is a perfectly good thing to use, but it isn't a great tip in my mind. This isn't something I would put on a list of tips.
In general, the list of tips doesn't just doesn't seem very helpful.
https://news.ycombinator.com/item?id=7803101
I will also add that looking in to hessian free for training over conjugate gradient/LBFGS/SGD for feed forward nets has proven to be amazing[1].
Recursive nets I'm still playing with yet, but based on the work by socher, they used LBFGS just fine.
[1]: http://www.cs.toronto.edu/~rkiros/papers/shf13.pdf
[2]: http://socher.org/