undefined | Better HN

0 pointsmicrotonal6mo ago0 comments

Lots of research focus around that time was on recurrent models—because that was the conventional wisdom about how you model sequences. Markov chains had led to vanilla RNNs, LSTMs, GRU, etc., which all seemed tantalizingly promising. (MAMBA fans take note.) Attention mechanisms were even used in recurrent models…but so was everything else.

I feel like there is a step missing here...

People were using RNN encoders/decoders for machine translation - the encoder was used to make a representation (fixed-size vector) of the source language sentence, the decoder generated the target language sentence from the source representation.

The issue that people were bumping into is that the fixed-sized vector bottlenecked the encoder/decoder architecture. Representing a variable-length source sentence as a fixed-size vector leads to a loss of information that increases with the source sentence length.

People started adding attention to the decoder as a way to work around this issue. Each decoder step could attend to every token (well, RNN hidden representation) of the source sentence. So, this led to the RNN + attention architecture.

The title 'Attention is all you need' comes from the realization that in this architecture the RNN is not needed, neither for the encoder and decoder. It's a message to the field who was using RNNs + attention (to avoid the bottleneck). Of course, the rest was born from that, encoder-only transformer models like BERT and decoder-only models like current LLMs.

0 comments

cgearhart6mo ago

This is a fair point and clarification. :-)

j / k navigate · click thread line to collapse

0 pointsmicrotonal6mo ago0 comments

I feel like there is a step missing here...