I feel like there is a step missing here...
People were using RNN encoders/decoders for machine translation - the encoder was used to make a representation (fixed-size vector) of the source language sentence, the decoder generated the target language sentence from the source representation.
The issue that people were bumping into is that the fixed-sized vector bottlenecked the encoder/decoder architecture. Representing a variable-length source sentence as a fixed-size vector leads to a loss of information that increases with the source sentence length.
People started adding attention to the decoder as a way to work around this issue. Each decoder step could attend to every token (well, RNN hidden representation) of the source sentence. So, this led to the RNN + attention architecture.
The title 'Attention is all you need' comes from the realization that in this architecture the RNN is not needed, neither for the encoder and decoder. It's a message to the field who was using RNNs + attention (to avoid the bottleneck). Of course, the rest was born from that, encoder-only transformer models like BERT and decoder-only models like current LLMs.