undefined | Better HN

0 pointsquantadev1y ago0 comments

I just have a hunch we're in early days still even with Transformers architectures. The MLP (Perceptron) is such a simple mathematical structure and mostly doing linear stuff (tons of multiplications, then a few adds, and a squashing-type activation function), plus the attention heads add-on from the Transformers paper too, of course (and other minor things) but ultimately it's a very easy to understand data structure so it's hard for me to believe there's not massive leaps and bounds that we can take to gain orders of magnitude more performance just like the leap that the Transformers paper had.

0 comments

littlestymaar1y ago

> We can take to gain orders of magnitude more performance just like the leap that the Transformers paper had.

Afaik the most important benefit of transformers aren't their “performance” (in the sense of ability to perform their tasks) but their scalability which come from their ability to be trained and evaluated efficiently on big GPU clusters, which isn't something you can do with recurrent neural networks.

And then, if I understood correctly, the benefit of state-space models being that you can train them in parallel and run them in a recurrent fashion, making inference cheaper than transformers especially when context size grow.

quantadevOP1y ago

The biggest thing I had understood about the Transformers Paper (Attention is all you Need) is how the "attention heads" vectors are wired up in such a way as to allow words to be "understood" in the proper context. In other words "see spot run" is different from "run a computer program" has dramatically different but specific context for the word "run".

It was also my understanding that without those attention heads even the scaling up to current parameter sizes we have to day would not have ended up with the level of emergent intelligence that shocked the world with GPT 3.5. We needed both very large models and words put into semantic context in semantic space.

littlestymaar1y ago

Attention heads existed before Transformers, they where used in recurrent neural networks (RNN) to improve their performance. The paper is called “Attention is all you need” because transformers keep the attention head while discarding the RNN part entirely.

Getting rid of RNN vastly improved training scalability and allowed big players to start training enormous models on even more enormous training set in ways that weren't possible with a RNN AFAIK.

1 more reply

j / k navigate · click thread line to collapse

0 pointsquantadev1y ago0 comments

0 comments

littlestymaar1y ago

> We can take to gain orders of magnitude more performance just like the leap that the Transformers paper had.

quantadevOP1y ago

littlestymaar1y ago

Getting rid of RNN vastly improved training scalability and allowed big players to start training enormous models on even more enormous training set in ways that weren't possible with a RNN AFAIK.

1 more reply

j / k navigate · click thread line to collapse