undefined | Better HN

0 pointslittlestymaar1y ago0 comments

Attention heads existed before Transformers, they where used in recurrent neural networks (RNN) to improve their performance. The paper is called “Attention is all you need” because transformers keep the attention head while discarding the RNN part entirely.

Getting rid of RNN vastly improved training scalability and allowed big players to start training enormous models on even more enormous training set in ways that weren't possible with a RNN AFAIK.

0 comments

quantadev1y ago

When discussing "Attention Heads" in the context of the Transformers Paper, there's no need to put the word "Self" in front of it, as in "Self-Attention". That's the context in which I used the word Attention above. Something similar to self-attention had pre-existed this paper, but not actual self-attention.

You're right that getting rid of "Recurrence" was another innovation, but removing it was probably more of a hack to make things parallelizable, than something that was architecturally justifiable from first principles (like self-attention is), because there's definite "power" in Recurrence (making it desirable), but it's just too costly to run that in LLMs because of CPU cycles.

littlestymaarOP1y ago

> removing it was probably more of a hack to make things parallelizable

But that's the entire point of it. Transformer-based LLM are “more intelligent” just because you can make them bigger and train them on bigger datasets because of this parallelization.

quantadev1y ago

It's not just about size. Self-Attention is every bit as important as large size, because if we had the current large size, but without Self-Attention we wouldn't have the emergent intelligence. Also "size" isn't even a new innovation. Self-Attention was a new innovation.

1 more reply

j / k navigate · click thread line to collapse

0 pointslittlestymaar1y ago0 comments

Getting rid of RNN vastly improved training scalability and allowed big players to start training enormous models on even more enormous training set in ways that weren't possible with a RNN AFAIK.

0 comments

quantadev1y ago

littlestymaarOP1y ago

> removing it was probably more of a hack to make things parallelizable

But that's the entire point of it. Transformer-based LLM are “more intelligent” just because you can make them bigger and train them on bigger datasets because of this parallelization.

quantadev1y ago

1 more reply

j / k navigate · click thread line to collapse