You're mixing up cause and effect. The transformer architecture was
invented for machine translation – and it's pretty good at it! (
Very far from human-level, but still mostly comprehensible, and a significant improvement over the state-of-the-art at time of first publication.) But we shouldn't treat this as anything more than "special-purpose ML architecture achieves decent results".
The GPT architecture, using transformers to do iterated predictive text, is a modern version of the Markov bot. It's truly awful at translation, when "prompted" to do so. (Perhaps surprisingly so, until you step back, look at the training data, and look at the information flow: the conditional probability of the next token isn't mostly coming from the source text.)
I haven't read that paper yet, but it looks interesting. From the abstract, it looks like one of those perfectly-valid papers that laypeople think is making a stronger claim than it is. This paragraph supports that:
> Note that these models are not intended to accurately capture natural language. Rather, they illustrate how our theory can be used to study the effect of language similarity and complexity on data requirements for UMT.