undefined | Better HN

0 pointstbalsam2y ago0 comments

Part of the issue here is posting a LessWrong post. There is some good in there, but much of that site is like a Flat Earth conspiracy theory for neural networks.

Neural network training [edit: on a fixed point task, as is often the case {such as image->label}] is always (always) biphasic necessarily, so there is no "eventual recovery from overfitting". In my experience, it is just people newer to the field or just noodling around fundamentally misunderstanding what is happening, as their network goes through a very delayed phase change. Unfortunately there is a significant amplification to these kinds of posts and such, as people like chasing the new shiny of some fad-or-another-that-does-not-actually-exist instead of the much more 'boring' (which I find fascinating) math underneath it all.

To me, as someone who specializes in optimizing network training speeds, it just indicates poor engineering to the problem on the part of the person running the experiments. It is not a new or strange phenomenon, it is a literal consequence of the information theory underlying neural network training.

0 comments

PoignardAzur2y ago

> Part of the issue here is posting a LessWrong post

I mean, this whole line of analysis comes from the LessWrong community. You may disagree with them on whether AI is an existential threat, but the fact that people take that threat seriously is what gave us this whole "memorize-or-generalize" analysis, and glitch tokens before that, and RLHF before that.

tbalsamOP2y ago

I think you may be missing the extensive lines of research covering those topics. Memorization vs Generalization has been a debate before LW even existed in the public eye, and inputs that networks have unusual sensitivity to have been well studied as well (re:chaotic vs linear regimes in neural networks). Especially the memorization vs generalization bit -- that has been around for...decades. It's considered a fundamental part of the field, and has had a ton of research dedicated to it.

I don't know much either way about RLHF in terms of its direct lineage, but I highly doubt that is actually what happened, since DeepMind is actually responsible for the bulk of the historical research supporting those methods.

It's possible ala the broken clock hypothesis + LessWrong is obviously not the "primate at a typewriter" situation, so there's a chance of some people scoring meaningful contributions, but the signal to noise ratio is awful. I want to get something out of some of the posts I've tried to read there, but there are so many bad takes written with more bombastic language that it's really quite hard indeed.

Right now, it's an active detractor to the field because it pulls attention away from things that are much more deserving of energy and time. I honestly wish the vibe was back to people even just making variations of Char-RNN repos based on Karpathy's blog posts. That was a much more innocent time.

PoignardAzur2y ago

> I think you may be missing the extensive lines of research covering those topics. Memorization vs Generalization

I meant this specific analysis, that neural networks that are over-parameterized will at first memorize but, if they keep training on the same dataset with weight decay, will eventually generalize.

Then again, maybe there have been analyses done on this subject I wasn't aware of.

1 more reply

woopwoop2y ago

I don't think that is true? As far as I know the grokking phenomenon was first observed (and the name coined) in this paper, not in any blog post:

https://arxiv.org/abs/2201.02177

tbalsamOP2y ago

That's true, and I probably should have done some better backing up, sorting out, and clarification. I remember when that paper came out, it rubbed me the wrong way too then, because it is people rediscovering double descent from a different perspective, and not recognizing it as such.

What it would be better defined as is "a sudden change in phase state after a long period of metastability". Even then it ignores that those sharp inflections indicate a poor KL between some of the inductive priors and the data at hand.

You can think about it as the loss signal from the support of two gaussians extremely far apart with narrow standard deviations. Sure, they technically have support, but in a noisy regime you're going to have nothing.... nothing.... nothing....and then suddenly something as you hit that point of support.

Little of the literature, definitions around the word, or anything like that really takes this into account generally, leading to this mass illusion that this is not a double descent phenomenon, when in fact it is.

Hopefully this is a more appropriate elaboration, I appreciate your comment pointing out my mistake.

1 more reply

ShamelessC2y ago

> Part of the issue here is posting a LessWrong post. There is some good in there, but much of that site is like a Flat Earth conspiracy theory for neural networks.

Indeed! It’s very frustrating that so many people here are such staunch defenders of LessWrong. Some/much of the behavior there is honestly concerning.

tbalsamOP2y ago

100% agreed. I'm pretty sure today was the first time I learned that the site was founded by Yudkowsky, which honestly explains quite a bit (polite 'lol' added here for lightheartedness)

tbalsamOP2y ago

To further clarify things, the reason there is no mystical 'eventual recovery from overfitting ' is because overfitting is a stable bound that is approached. Adding this false denomination to this implies a non-biphasic nature to neural network training, and adds false information that wasn't there before.

Thankfully things are pretty stable in the over/underfitting regime. I feel sad when I see ML misinformation propagated on a forum that requires little experience but has high leverage due to the rampant misuse of existing terms and complete invention of a in-group-language that has little touch with the mathematical foundations of what's happening behind the scenes. I've done this for 7-8 years at this point at a pretty deep level and have a strong pocket of expertise, so I'm not swinging at this one blindly.

Noumenon722y ago

What are the two phases? What determines when you switch?

tbalsamOP2y ago

Memorization of individual examples -> generalization, I can't speak about the determinant of switching as that is (partially, to some degree) work I'm working on, and I have a personal rule not to share work in progress until it's completed (and then be very open and explicit about it). My apologies on that front.

However, I can point you to one comment I made earlier in this particular comment section about the MDL and how that relates to the L2 norm. Obviously this is not the only thing that induces a phase change, but it is one of the more blatant ones that's been covered little more publicly by different people.

j / k navigate · click thread line to collapse

0 comments

PoignardAzur2y ago

> Part of the issue here is posting a LessWrong post

tbalsamOP2y ago

PoignardAzur2y ago

> I think you may be missing the extensive lines of research covering those topics. Memorization vs Generalization

I meant this specific analysis, that neural networks that are over-parameterized will at first memorize but, if they keep training on the same dataset with weight decay, will eventually generalize.

Then again, maybe there have been analyses done on this subject I wasn't aware of.

1 more reply

woopwoop2y ago

I don't think that is true? As far as I know the grokking phenomenon was first observed (and the name coined) in this paper, not in any blog post:

https://arxiv.org/abs/2201.02177

tbalsamOP2y ago

Hopefully this is a more appropriate elaboration, I appreciate your comment pointing out my mistake.

1 more reply

ShamelessC2y ago

> Part of the issue here is posting a LessWrong post. There is some good in there, but much of that site is like a Flat Earth conspiracy theory for neural networks.

Indeed! It’s very frustrating that so many people here are such staunch defenders of LessWrong. Some/much of the behavior there is honestly concerning.

tbalsamOP2y ago

100% agreed. I'm pretty sure today was the first time I learned that the site was founded by Yudkowsky, which honestly explains quite a bit (polite 'lol' added here for lightheartedness)

tbalsamOP2y ago

Noumenon722y ago

What are the two phases? What determines when you switch?

tbalsamOP2y ago

j / k navigate · click thread line to collapse