undefined | Better HN

0 pointsYeGoblynQueenne2y ago0 comments

That's the "bitter lesson", right? Which is really a sour lesson- as in sour grapes. See, Rich Sutton's point with his Bitter Lesson is that encoding expert knowledge only improves performance temporarily, which is eventually surpassed by more data and compute.

There are only two problems with this: One, statistical machine learning systems have an extremely limited ability to encode expert knowledge. The language of continuous functions is alien to most humans and it's very difficult to encode one's intuitive, common sense knowledge into a system using that language [1]. That's what I mean when I say "sour grapes". Statistical machine learning folks can't use expert knowledge very well, so they pretend it's not needed.

Two, all the loud successes of statistical machine learning in the last couple of decades are closely tied to minutely specialised neural net architectures: CNNs for image classification, LSTMs for translation, Transformers for language, Difussion models and Ganns for image generation. If that's not encoding knowledge of a domain, what is?

Three, because of course three, despite point number two, performance keeps increasing only as data and compute increases. That's because the minutely specialised architectures in point number two are inefficient as all hell; the result of not having a good way to encode expert knowledge. Statistical machine learning folk make a virtue out of necessity and pretend that only being able to increase performance by increasing resources is some kind of achievement, whereas it's exactly the opposite: it is a clear demonstration that the capabilities of systems are not improving [2]. If capabilities were improving, we should see the number of examples required to train a state-of-the-art system either staying the same, or going down. Well, it ain't.

Of course the neural net [community] will complain that their systems have reached heights never before seen in classical AI, but that's an argument that can only be sustained by the ignorance of the continued progress in all the classical AI subjects such as planning and scheduling, SAT solving, verification, automated theorem proving and so on.

For example, and since planning is high on my priorities these days, see this video where the latest achievements in planning are discussed (from 2017).

https://youtu.be/g3lc8BxTPiU?si=LjoFITSI5sfRFjZI

See particularly around this point where he starts talking about the Rollout IW(1) symbolic planning algorithm that plays Atari from screen pixels with performance comparable to Deep-RL; except it does so online (i.e. no training, just reasoning on the fly):

https://youtu.be/g3lc8BxTPiU?si=33XSM6yK9hOlZJnf&t=1387

Bitter lesson my sweet little ass.

____________

[1] Gotta find where this paper was but none other than Vladimir Vapnik basically demonstrated this by trying the maddest experiment I've ever seen in machine learning: using poetry to improve a vision classifier. It didn't work. He's spent the last 20 years trying to find a good way to encode human knowledge into continuous functions. It doesn't work.

[2] In particular their capability for inductive generalisation which remains absolutely crap.

0 comments

gwern2y ago

Vapnik: https://www.cs.princeton.edu/courses/archive/spring13/cos511... https://engineering.columbia.edu/files/engineering/vapnik.pd... https://www.learningtheory.org/learning-has-just-started-an-... https://nautil.us/teaching-me-softly-234576/

The main paper: https://gwern.net/doc/reinforcement-learning/exploration/act...

It sounds kinda crazy (is there really that much far transfer?), but you know, I think it would work... He just needed to use LLMs instead: https://arxiv.org/abs/2309.10668#deepmind

YeGoblynQueenneOP2y ago

Yeah, that's one of the papers in that line of research by Vapnik. He's got a few with similar content. Visually, it's not the paper I remember, I'll have to read it again to be sure.

If I remember correctly, Vapnik's point is, we know that Big Data Deep Learning works; now, try to do the same thing with small data. Very much like my point that capabilities of models are not improving, only the scale increasing.

adw2y ago

> The language of continuous functions is alien to most humans and it's very difficult to encode one's intuitive, common sense knowledge into a system using that language

In other words; machine learned models are octopus brains (https://www.scientificamerican.com/article/the-mind-of-an-oc...) and that creeps you out. Fair enough, it creeps me out too, and we should honour our emotions — I'm no rationalist – but we should also be aware of the risks of confusing our emotional responses with reality.

YeGoblynQueenneOP2y ago

Please don't god mode me? Machine learning doesn't creep me out. I'm sorry it creeps you out. In my culture, octopus is a prized delicacy, my dad used to fish them out of the sea with his bare hands when I was a kid. If you wanna creep me out, you should try snake, not octopus.

famouswaffles2y ago

>Two, all the loud successes of statistical machine learning in the last couple of decades are closely tied to minutely specialised neural net architectures: CNNs for image classification, LSTMs for translation, Transformers for vision, Difussion models and Ganns for image generation. If that's not encoding knowledge of a domain, what is?

Transformers, Diffusion for Vision, Image generation are really odd examples here. None of those architectures or training processes are tuned for Vision in mind lol. It was what? 3 years after Attention 2017 before the famous Vit paper. CNNs have lost a lot of favor to Vits, LSTMs are not the best performing translators today.

The bitter lesson is that less encoding of "expert" knowledge results in better performance and this has absolutely held up. The "encoding of knowledge" you call these architectures is nowhere near that of the GOFAI kind and even more than that, less biased NN architectures seem to be winning out.

>That's because the minutely specialised architectures in point number two are inefficient as all hell; the result of not having a good way to encode expert knowledge.

Inefficient is a whole lot better than can't even play the game, the story of GOFAI for the last few decades.

>If capabilities were improving, we should see the number of examples required to train a state-of-the-art system either staying the same, or going down. Well, they ain't.

The capabilities of models are certainly increasing. Even your example is blatantly wrong. Do you realize how much more data and compute it would take to train a Vanilla RNN to say GPT-3 level performance?

YeGoblynQueenneOP2y ago

>> Inefficient is a whole lot better than can't even play the game, the story of GOFAI for the last few decades.

See e.g. my link above where GOFAI plays the game (Atari) very well indeed.

Also see Watson winning Jeopardy (a hybrid system, but mainly GOFAI - using frames and Prolog for knowledge extraction, encoding and retrieval).

And Deep Blue beating Kasparov. And MCTS still the SOTA search algo in Go etc.

And EURISCO playing Traveller as above.

And Pluribus playing Poker with expert game-playing knowledge.

And the recent neuro-symbolic DeepMind thingy that solves geometry problems from the maths olympiad.

etc. etc. [Gonna stop editing and adding more as they come to my mind here.]

And that's just playing games. As I say in my comment above planning and scheduling, SAT, constraints, verification, theorem proving- those are still dominated by classical systems and neural nets suck at them. Ask Yan LeCun: "Machine learning sucks". He means it sucks in all the things that classical AI does best and he means he wants to do them with neural nets, and of course he'll fail.

adw2y ago

> And MCTS still the SOTA search algo in Go etc

It's often forgotten that Rich Sutton said the two things which work are learning (the AlphaGo/Leela Zero policy network) and search (MCTS). (I think the most interesting research in ML is around the circumstances in which large models wind up performing implicit search.)

YeGoblynQueenneOP2y ago

Well, gradient optimisation is a form of search.

famouswaffles2y ago

That was a figure of speech. I didn't literally mean games (not that GOFAI performs better than NNs in those games anyway). I simply went off your own examples - Vision, Image generation, Translation etc.

>As I say in my comment above planning and scheduling, SAT, constraints, verification, theorem proving- those are still dominated by classical systems

You can use NNs for all these things. It wouldn't make a lot of sense because GOFAI would be perfect and the former would be inefficient but you certainly could which is again more than I can say for GOFAI and the domains you listed.

YeGoblynQueenneOP2y ago

I don't understand your comment. Clarify.

As it is, your comment seems to tell me that neural nets are good at neural net things and GOFAI is good at GOFAI things, which is obvious, and is what I'm saying: neural nets can make only very limited use of expert knowledge and so suck in all domains where domain knowledge is abundant and abundantly useful, which are the same domains where GOFAI dominates. GOFAI can make very good use of expert knowledge but is traditionally not as good in domains where only tacit knowledge is available, because we don't understand the domain well enough yet, like in anything to do with pattern recognition, which is the same domains where neural nets dominate. If explicit, expert knowledge was available for those domains, then GOFAI would dominate, and neural nets would fall behind, completely contrary to what Sutton thinks.

So, the bitter lesson is only bitter for those who are not interested in what classical AI systems can do best. For those of us who are, the lesson is sweet indeed: we're making progress, algorithmic progress, progress in understanding, scientific progress, and don't need to burn through thousands of credit to train on server farms to do anything of note. That's even a running joke in my team: hey, do you need any server time? Nah, I'll run the experiment on my laptop over lunch. And then beat the RL algo (PPO) that needs three days training on GPUs. To solve mazes badly.

1 more reply

YeGoblynQueenneOP2y ago

Addendum:

>> Do you realize how much more data and compute it would take to train a Vanilla RNN to say GPT-3 level performance?

Oh, good point. And what would GPT-3 do with the typical amount of data used to train an LSTM? Rhetorical.

adw2y ago

Yeah, all of those architectures are _themselves_ hacks to get around having insufficient compute! They absolutely were encoding inductive biases into the network to get around not being able to train enough, and transformers (handwaving hard enough to levitate, the currently-trainable model family with the least inductive bias) have eaten the world in all domains.

This is evidence _for_ the Bitter Lesson, not against it.

YeGoblynQueenneOP2y ago

They haven't (eaten the world etc). They just happen to be the models that trend hard right now. I bet if you could compare like for like you'd be able to see some improvement in performance from Transformers, but that 'd be extremely hard to separate from the expected improvement from the constantly increasing amounts of data and compute. For example, you could, today, train a much bigger and deeper Multi-Layered Perceptron than you could thirty years ago, but nodoy is trying because that's so 1990's, and in any case they have the data and compute to train much bigger, much more inefficient (contrary to what you say if I got that right) architectures.

Wait a few years and the Next Big Thing in AI will come along, hot on the heels of the next generation of GPUs, or tensor units or whatever the hardware industry can cook up to sell shovels for the gold rush. By then, Transfomers will have hit the plateau of diminishing returns, there'll be gold in them there other hills and nobody would talk of LLMs anymore because that's so 2020s. We've been there so many times before.

adw2y ago

> much more inefficient

The tricky part here is that "efficiency" is not a single dimension! Transformers are much more "efficient" in one sense, in that they appear to be able to absorb much more data before they saturate; they're in general less computationally efficient in that you can't exploit symmetries as hard, for example, at implementation time.

Let's talk about that in terms of a concrete example: the big inductive bias of CNNs for vision problems is that CNNs essentially presuppose that the model should be translation-invariant. This works great — speeds up training and makes it more stable – until it doesn't and that inductive bias starts limiting your performance, which is in the large-data limit.

Fully-connected NNs are more general than transformers, but they have _so many_ degrees of freedom that the numerical optimization problem is impractical. If someone figures out how to stabilize that training and make these implementable on current or future hardware, you're absolutely right that you'll see people use them. I don't think transformers are magic; you're entirely correct in saying that they're the current knee on the implementability/trainability curve, and that can easily shift given different unit economics.

I think one of the fundamental disconnects here is that people who come at AI from the perspective of logic down think of things very differently to people like me who come at it from thermodynamics _up_.

Modern machine learning is just "applications of maximum entropy", and to someone with a thermodynamics background, that's intuitively obvious (not necessarily correct! just obvious) –in a meaningful sense the _universe_ is a process of gradient descent, so "of course" the answer for some local domain models is maximum-entropy too. In that world view, the higher-order structure is _entirely emergent_. I'm, by training, a crystallographer, so the idea that you can get highly regular structure emerging from merciless application of a single principle is just baked into my worldview very deeply.

Someone who comes at things from the perspective of mathematical logic is going to find that worldview very weird, I suspect.

2 more replies

j / k navigate · click thread line to collapse

0 comments

gwern2y ago

The main paper: https://gwern.net/doc/reinforcement-learning/exploration/act...

It sounds kinda crazy (is there really that much far transfer?), but you know, I think it would work... He just needed to use LLMs instead: https://arxiv.org/abs/2309.10668#deepmind

YeGoblynQueenneOP2y ago

Yeah, that's one of the papers in that line of research by Vapnik. He's got a few with similar content. Visually, it's not the paper I remember, I'll have to read it again to be sure.

adw2y ago

> The language of continuous functions is alien to most humans and it's very difficult to encode one's intuitive, common sense knowledge into a system using that language

YeGoblynQueenneOP2y ago

famouswaffles2y ago

>That's because the minutely specialised architectures in point number two are inefficient as all hell; the result of not having a good way to encode expert knowledge.

Inefficient is a whole lot better than can't even play the game, the story of GOFAI for the last few decades.

>If capabilities were improving, we should see the number of examples required to train a state-of-the-art system either staying the same, or going down. Well, they ain't.

YeGoblynQueenneOP2y ago

>> Inefficient is a whole lot better than can't even play the game, the story of GOFAI for the last few decades.

See e.g. my link above where GOFAI plays the game (Atari) very well indeed.

Also see Watson winning Jeopardy (a hybrid system, but mainly GOFAI - using frames and Prolog for knowledge extraction, encoding and retrieval).

And Deep Blue beating Kasparov. And MCTS still the SOTA search algo in Go etc.

And EURISCO playing Traveller as above.

And Pluribus playing Poker with expert game-playing knowledge.

And the recent neuro-symbolic DeepMind thingy that solves geometry problems from the maths olympiad.

etc. etc. [Gonna stop editing and adding more as they come to my mind here.]

adw2y ago

> And MCTS still the SOTA search algo in Go etc

YeGoblynQueenneOP2y ago

Well, gradient optimisation is a form of search.

famouswaffles2y ago

>As I say in my comment above planning and scheduling, SAT, constraints, verification, theorem proving- those are still dominated by classical systems

YeGoblynQueenneOP2y ago

I don't understand your comment. Clarify.

1 more reply

YeGoblynQueenneOP2y ago

Addendum:

>> Do you realize how much more data and compute it would take to train a Vanilla RNN to say GPT-3 level performance?

Oh, good point. And what would GPT-3 do with the typical amount of data used to train an LSTM? Rhetorical.

adw2y ago

This is evidence _for_ the Bitter Lesson, not against it.

YeGoblynQueenneOP2y ago

adw2y ago

> much more inefficient

Someone who comes at things from the perspective of mathematical logic is going to find that worldview very weird, I suspect.

2 more replies

j / k navigate · click thread line to collapse