undefined | Better HN

0 pointsdrdeca2y ago0 comments

Can you say what that says about the behavior described with the modular arithmetic in the article?

And, in particular, how to interpret the fact that different hyperparameters determined whether runs, obtaining equally high accuracy on the training data, got good or bad scores on the test data, in terms of the "view it as a kernel machine/interpolation" lens?

My understanding is that the behavior in at least one of those "models learned by gradient descent are equivalent to [some other model]" papers, works by constructing something which is based on the entire training history of the network. Is that the kernel machines one, or some other one?

0 comments

bippihippi12y ago

if you train a model on modular arithmatic, it can only learn what's in the training data. if all of the examples are of the form a + b mod 10, it isn't likely to generalize to be able to solve a + b mod 12. a human can learn the rule and figure it out. a model can't that's why a diverse training set is so important. it's possible to train a model to aproximate any function, but whether the approximation is accurate outside of the datapoints you trained on is not reliable, as far as I understand.

different hyperparameters can give a model that us over or underfit, but this helps the model interpolate, not generalize. it can know all the answers similar to the training data, not answers different to or it

j / k navigate · click thread line to collapse

0 pointsdrdeca2y ago0 comments

Can you say what that says about the behavior described with the modular arithmetic in the article?

0 comments

bippihippi12y ago

j / k navigate · click thread line to collapse