undefined | Better HN

0 pointsfamouswaffles2y ago0 comments

The bitter lesson isn't really "algorithms bad", "don't try different approaches", "don't innovate" or "only work on models with massive compute".

The heart of the bitter lesson is "don't try to codify "insight" into the process". It's basically the age old "you don't know what you don't know".

The Transformer is kind of a perfect example. It boasts algorithmic improvements over RNNs and LLMs are by far the best performing take on language modelling ever. And yet the architecture itself has basically no breakthrough from understanding language itself. It's an improvement over standard RNNs but not really because of any new found insight or implementation on language itself.

Basically trying to cram human high level instincts/insights into the process of solving a problem doesn't work better than giving a general architecture tons of data and letting it figure that all out by itself.

0 comments

godelski2y ago

> The heart of the bitter lesson is "don't try to codify "insight" into the process".

This is exactly right and what a lot of people get wrong. Sutton isn't saying that you can't have constraints in your network either. He also isn't saying "no need to learn math", which is a far too common interpretation I've seen. It isn't just data and scale, algorithms are critical too. Just don't force aspects like Gabor filters, symmetry, etc. This doesn't mean works like geometric deep learning are dead (alpha fold even uses it!). The reason to not force insights is because they sometimes don't hold in high dimensions and sometimes our assumptions are wrong. It can also limit the path to reach the optimal/desired solution even if the optimal solution has those constraints. But I am specifically saying "force" because we can hint and we are always using some human insight.

FeepingCreature2y ago

I'd argue it's even "you don't know what you do know." We cannot codify what we don't understand, and while we understand and can verbalize some parts of our thinking, others, maybe even the great majority, are hidden from us. We just get a feeling.

Retric2y ago

LLM’s do use human ”insight” into language with how they require tokenized inputs and outputs.

It’s one of those insights that seems obvious after the fact but really wasn’t.

famouswafflesOP2y ago

That could count I suppose but I don't think that's really the kind of insight Sutton is alluding to in his original writing. Insight in this case would be more like shoehorning one of the processes humans would use to solve the problem. There are no innate grammar rules the architecture looks to before each attempt, no tree or word search. Things like that.

Polishing the input in that way is neat but it's not like you can't go character or word level for a transformer. The current way is just far more compute efficient but the Transformer will figure out the seq to seq all the same.

Retric2y ago

It doesn’t just polish the input. Tokenizing the output also significantly reduces the risk of gibberish especially if you do a grammar pass to ensure tense matches etc. It means a model with a much worse understanding of the language can preform better than something operating on raw characters.

1 more reply

nailer2y ago

> Basically trying to cram human high level instincts/insights into the process of solving a problem doesn't work better than giving a general architecture tons of data and letting it figure that all out by itself.

Hi, programmer from outside ML here. You might be able to answer something I've been wandering about.

I do remember things like NLTK and logical inference many years ago. I understand the current tech is all large language models and (as you put it) the model figures out the rules.

Sometimes I get responses from ChatGPT that seem like they wouldn't pass logical inference. I will think "all the foos aren't capable of X, bar is an instance of foo, stop suggesting bar to do X". Is there room for old-school logical inference as a kind of sanity-check layer on top of LLMs?

HPsquared2y ago

I wonder if they'll end up with specialized subunits for different processing tasks, like the old "lizard brain" model with the neocortex on top of other layers:

https://en.wikipedia.org/wiki/Triune_brain

famouswafflesOP2y ago

Nothing wrong with that at all. Could be a viable solution for specific use-cases. But for know, most researchers will focus on innately improving those abilities. Right now that would mostly be by increasing scale (data or parameter size), highly curated data for the specific deficiency or work on making transformers scale more efficiently. after all, GPT-4 is much better at logical reasoning than 3.5 and we still haven't hit a functional limit on scaling transformers.

shawntan2y ago

But "don't try to codify 'insight' into the process" seems to suggest "don't try different approaches". I'm not sure how people can at once trot out the "Bitter Lesson" and interpret it as it is written, but still say "We're not saying not to think about new approaches".

Is the idea then to work only on methods that allow for faster compute of more data?

FWIW, the Transformer works faster on current methods of parallelisation, allowing for dramatic scaling that RNNs will find hard to compete on. But we do pay for that in terms of what can be computed (https://arxiv.org/pdf/2207.00729.pdf - TL;DR: Transformers are limited in the types of programs/functions it can compute because of parallelism).

Scaling, ironically, does seem to be the 'direction of steepest descent' in terms of what will bring the best performance (for now). Gradient descent does find pleasant local optima that may keep us happy for a while.

famouswafflesOP2y ago

As far as approach is concerned, all the bitter lesson advises against is trying to shoehorn human high level processes into the architecture. There's still plenty of room for different approaches outside of just faster compute.

CNNs and Transformers are very different. Both can be used for computer vision. The bitter lesson wouldn't stop you from switching from one to the other.

shawntan2y ago

The scope of "what to try" is large, we (as a community) should prioritise things that we think would work. If the criteria is not only "faster compute" it would seem "things that mimic human high level processes" would be a good candidate.

We started with MLPs then CNNs were invented, and that brought on pretty large gains. Arguably CNNs are architectures inspired by "human high level processes".

Edit: I will say though, this is a new take on the nuance of "Bitter Lesson" that I've never heard, though even this interpretation I find to be strangely contradictory for the reasons above.

1 more reply

cs7022y ago

Bingo!

That is the bitter lesson.

Thank you for posting this here!

j / k navigate · click thread line to collapse

0 comments

godelski2y ago

> The heart of the bitter lesson is "don't try to codify "insight" into the process".

FeepingCreature2y ago

Retric2y ago

LLM’s do use human ”insight” into language with how they require tokenized inputs and outputs.

It’s one of those insights that seems obvious after the fact but really wasn’t.

famouswafflesOP2y ago

Retric2y ago

1 more reply

nailer2y ago

Hi, programmer from outside ML here. You might be able to answer something I've been wandering about.

I do remember things like NLTK and logical inference many years ago. I understand the current tech is all large language models and (as you put it) the model figures out the rules.

HPsquared2y ago

I wonder if they'll end up with specialized subunits for different processing tasks, like the old "lizard brain" model with the neocortex on top of other layers:

https://en.wikipedia.org/wiki/Triune_brain

famouswafflesOP2y ago

shawntan2y ago

Is the idea then to work only on methods that allow for faster compute of more data?

famouswafflesOP2y ago

CNNs and Transformers are very different. Both can be used for computer vision. The bitter lesson wouldn't stop you from switching from one to the other.

shawntan2y ago

We started with MLPs then CNNs were invented, and that brought on pretty large gains. Arguably CNNs are architectures inspired by "human high level processes".

Edit: I will say though, this is a new take on the nuance of "Bitter Lesson" that I've never heard, though even this interpretation I find to be strangely contradictory for the reasons above.

1 more reply

cs7022y ago

Bingo!

That is the bitter lesson.

Thank you for posting this here!

j / k navigate · click thread line to collapse