undefined | Better HN

0 pointswizzwizz41y ago0 comments

I think you think it's a magic box. There's not actually such thing as a "strong language model", not in the way you're using the concept.

> We hope that what we are building might at least improve the state-of-the-art there.

Do you have any theoretical arguments for how and why it would improve it? If not, my concern is that you're just sucking the air out of the room. (Research into "throw a large language model at the problem" doesn't tend to produce any insight that could be used by other approaches, and doesn't tend to work, but it does funnel a lot of grant funding into cloud providers' pockets.)

0 comments

canjobear1y ago

“Throw an LM at it” is the only approach that has ever produced human level machine translation.

For theory on how a strong target-language-side LM can improve translation, even in the extreme scenario where no parallel “texts” are available, https://proceedings.neurips.cc/paper_files/paper/2023/file/7...

wizzwizz4OP1y ago

You're mixing up cause and effect. The transformer architecture was invented for machine translation – and it's pretty good at it! (Very far from human-level, but still mostly comprehensible, and a significant improvement over the state-of-the-art at time of first publication.) But we shouldn't treat this as anything more than "special-purpose ML architecture achieves decent results".

The GPT architecture, using transformers to do iterated predictive text, is a modern version of the Markov bot. It's truly awful at translation, when "prompted" to do so. (Perhaps surprisingly so, until you step back, look at the training data, and look at the information flow: the conditional probability of the next token isn't mostly coming from the source text.)

I haven't read that paper yet, but it looks interesting. From the abstract, it looks like one of those perfectly-valid papers that laypeople think is making a stronger claim than it is. This paragraph supports that:

> Note that these models are not intended to accurately capture natural language. Rather, they illustrate how our theory can be used to study the effect of language similarity and complexity on data requirements for UMT.

canjobear1y ago

It’s true that the Transformer architecture was developed for seq2seq MT, but you can get similar performance with Mamba or RWKV or other new non-Transformer architectures. It seems that what is important is having a strong general sequence-learning architecture plus tons of data.

> The GPT architecture, using transformers to do iterated predictive text, is a modern version of the Markov bot.

The Markov nature only matters if the text falls outside the context window.

> Perhaps surprisingly so, until you step back, look at the training data, and look at the information flow: the conditional probability of the next token isn't mostly coming from the source text.

I’m not sure what you’re getting at here. If it’s that you can predict the next token in many cases without looking at the source language, then that’s also true for traditional encoder-decoder architectures, so it’s not a problem unique to prompting. Or are you getting at problems arising from teacher-forcing?

Basically the question was how an LM could possibly help translation, and the answer is that it gives you a strong prior for the decoder. That’s also the basic idea in the theoretical UMT paper: you are trying to find a function from source to target language that produces a sensible distribution as defined by an LM.

j / k navigate · click thread line to collapse

0 comments

canjobear1y ago

“Throw an LM at it” is the only approach that has ever produced human level machine translation.

wizzwizz4OP1y ago

canjobear1y ago

> The GPT architecture, using transformers to do iterated predictive text, is a modern version of the Markov bot.

The Markov nature only matters if the text falls outside the context window.

> Perhaps surprisingly so, until you step back, look at the training data, and look at the information flow: the conditional probability of the next token isn't mostly coming from the source text.

j / k navigate · click thread line to collapse