Retentive Network: A Successor to Transformer for Large Language Models (opens in new tab)

(arxiv.org)

112 pointssangel2y ago19 comments

19 comments

ttul2y ago

The key differences between multi-head attention in Transformers and the proposed multi-scale retention (MSR) in RetNets are:

- Retention replaces the softmax in attention with an exponential decay along the sequence dimension. This allows formulating retention in a recurrent form for efficient O(1) inference.

- Retention heads use different decay rates (gamma values) for multi-scale modeling. Attention heads use the same softmax.

- Retention outputs are normalized per-head with GroupNorm before concatenation. Attention uses LayerNorm on the concatenated output.

- Retention can be computed in parallel, recurrent, or chunkwise recurrent modes. Attention is only parallel.

- The recurrent form enables RetNets to summarize long previous context into a fixed-size state during inference. Attention recomputes on the full context each step.

- So in summary, retention adapts attention to enable recurrent modeling and multi-scale decays. This provides efficiency benefits and competitive performance to Transformers.

sp3322y ago

With transformer models, it’s common to put instructions and system messages at the beginning of the input. But with this decay, the beginning of the input would always have the sparsest attention, right? Maybe the instructions should be moved to the end. But then again if it’s recurrent, you might want to prime it with a description of the task.

ttul2y ago

Hypothetically, fine tuned transformer networks would learn to attend more to the beginning and end of the sequence since that is where the instructions typically are. And lo and behold a recent paper demonstrated this to be true.

sfriedr2y ago

I only spent a few minutes skimming thr paper, but:

1) there are a lot of papers claiming to be the successor to the Transformer, and not all of them are cited; e.g., the MetaFormer is missing https://arxiv.org/abs/2111.11418. Another candidate that wasn't compares against (or at least argued why it wouldn't make sense to compare against) are the Hopfield Networks https://arxiv.org/abs/2008.02217.

So until a more solid Related Work section is written (their section is actually called "Relation to and Differences from Previous Methods") I reserve the right to be skeptical whether their model is the "best" successor to the Transformer.

2) they say in the abstract "We theoretically derive the connection between recurrence and attention" but I couldn't find a longer theorem-proof section. So either this is done only in a cursory manner, or the proof is very easy. Recurrence and attention have been around for a long time as concepts, so surely there are already proofs in similar contexts of this fact (I am not working in this particular area of Machine Learning, so I don't know the SOTA by heart, but I strongly suspect that these aspects have been discusses previously; thr Hopfield Network paper I linked to unearthes some theoretical facts about attention, for example).

So -based on my very cursory reading- this paper seems like an interesting approach, but I do see some holes in thr execution. Time will tell whether Rentetive Network will become mainstream or not.

Ok, this was my five minute review of the paper. Now I have to urgently return to completing my actual reviews for NeurIPS, haha.

whimsicalism2y ago

> 1) there are a lot of papers claiming to be the successor to the Transformer, and not all of them are cited; e.g., the MetaFormer is missing https://arxiv.org/abs/2111.11418. Another candidate that wasn't compares against (or at least argued why it wouldn't make sense to compare against) are the Hopfield Networks https://arxiv.org/abs/2008.02217.

Neither of those papers are NLP applicable? And I think it's perfectly fair to focus on the alternatives (ie. like H3 and RWKV) that have been able to scale up to LLM levels and perplexity, which neither of the alternatives you mention have. Should they just cite every 'is All You Need' paper?

sfriedr2y ago

Not, but having a small section in the paper would be reasonable, that illustrates why the most pertinent models that might be relevant at first sight (like the ones I cited) are actually not applicable.

The onus is on the authors to place their research in context and provide compelling arguments - not on the reader to guess why their model was compared against model A, but not model B.

What do I mean by "pertinent"? Of course it is not necessary to cite every "All You Need" paper.

But:

(A) I'd argue it would be necessary to cite those "All You Need" papers that have either gathered a fair amount of citations or media attention (which is the case for both of the papers I linked to), or are meaningful descendants (in the "has been cited by" tree) of thise papers. As I said, this is not really my field - but I would say there is some change that among the hundreds of papers that have cited the papers I linked, some have been scaled up to LLM levels and use basically the same MetaFormer/Hopfield architecture.

(B) If the above isn't the case and none of those models have been scaled to LLM levels - that's fine too. But then please tell the readers that you did due diligence and found that there actually is this gap in the literature (of course, feel free to close it yourself and then be the first one to train one of those models to such scales; that's the reward for doing a solid literature review - and who knows, maybe you stumble upon an even better model that will get you many citations).

(C) If you cannot perform a comprehensive literature search, but the models you compare against cover 90% of the models that out there (in production or research), and you can back that claim up - then, of course, you're safe too and I'd be very happy to be able concongratulate you that you really dis manage to achieve a breakthrough.

(D) Even if none of these things applt and it just costs too much in terms of computational power to train many other potentially competing models, or is too cumbersome to carry out a a comprehensive literature reading- that's also fine. You can then simply constrain your paper more and consider a more precisely defines slice of models, for which you then can actually do a through literature review and comparison. Then you'd also need to adapt your title though, so that it reflects the more precise scope. And please don't take this negatively, as a reader I'd much rather have a model that is proven with the highest scientific standard to be state-of-the-art on a more narrow scope, rather than a more broad claim, with only a moderate amount of evidence backing it up.

So, PLEASE, don't leave us guessing!

If you have a new candidate model that you claim is the "successor" -a strong word- but compare it to just 6 other models, and importantly, you don't let the reader know which of the options (A) to (C) apply, then you have to go with (D).

Machine learning is already too full with papers whose titles are overly broad. Somehow, other scientific disciplines have a much more sober title formulations, yet ML insists on colorful titles that usually are not particularly informative (and yes, "XYZ is All You Need" I consider to be an example of such a bad title).

1 more reply

whimsicalism2y ago

Amazing - 6.7 billion is significantly larger than I've seen any transformer alternative trained to so far (H3 only went up to 2.7b, e: oops - RWKV goes up to 14b)... cool to see that it appears to be scaling even better and the O(1) & O(N) scaling is great.

Wish there was more consistency on source of training data. Training on just The Pile would enable more clean comparison with most promising transformer alternatives, like H3 and give a better sense of how robust the perplexity improvements cited are.

bhy2y ago

RWKV has 14B version.

whimsicalism2y ago

good point, for some reason i always leave out rwkv when thinking of the transformer models.. perhaps because it is more of a redux

lumost2y ago

This a super interesting paper, but in my oppinion - they do not complete their core claim of offering a generalized mathematical structure for attention/recurrence. The specific structure they propose is very interesting and demonstrably efficient computationally - however they do not show that this approach produces similar accuracy as large LLMs.

I’m anxiously awaiting the follow up where someone tries spending 1MM+ on demonstrating this approaches effectiveness in a large language model context.

whimsicalism2y ago

> however they do not show that this approach produces similar accuracy as large LLMs.

I think they have demonstrated their case pretty well, unless there is some serious degradation of the scaling - 7b is pretty big.

turingfeel2y ago

Interestingly, I did see this tweet [0] mentioning a phase shift that occurs in transformers at exactly the scale RetNet stopped at. Probably simply coincidental but I was previously unaware of this phenomenon at such a scale in transformers.

[0] https://twitter.com/gordic_aleksa/status/1682479676910870529

1 more reply

wills_forward2y ago

MIT and Microsoft Researchers Introduce "RetNet" - An 8X Faster Transformer Alternative for AI

mirekrusin2y ago

The claim is parallelism for training which is not fixed speed up, different complexity for inference (constant time), and different complexity for large context inference (linear) - so nothing that can be summarised as 8x - or am I getting this summary wrong?

whimsicalism2y ago

the words per second i believe from the first graph in the paper

canjobear2y ago

What’s the MIT connection? The authors all seem to be affiliated with Tsinghua.

j / k navigate · click thread line to collapse

19 comments

ttul2y ago

The key differences between multi-head attention in Transformers and the proposed multi-scale retention (MSR) in RetNets are:

- Retention replaces the softmax in attention with an exponential decay along the sequence dimension. This allows formulating retention in a recurrent form for efficient O(1) inference.

- Retention heads use different decay rates (gamma values) for multi-scale modeling. Attention heads use the same softmax.

- Retention outputs are normalized per-head with GroupNorm before concatenation. Attention uses LayerNorm on the concatenated output.

- Retention can be computed in parallel, recurrent, or chunkwise recurrent modes. Attention is only parallel.

- The recurrent form enables RetNets to summarize long previous context into a fixed-size state during inference. Attention recomputes on the full context each step.

- So in summary, retention adapts attention to enable recurrent modeling and multi-scale decays. This provides efficiency benefits and competitive performance to Transformers.

sp3322y ago

ttul2y ago

sfriedr2y ago

I only spent a few minutes skimming thr paper, but:

So -based on my very cursory reading- this paper seems like an interesting approach, but I do see some holes in thr execution. Time will tell whether Rentetive Network will become mainstream or not.

Ok, this was my five minute review of the paper. Now I have to urgently return to completing my actual reviews for NeurIPS, haha.

whimsicalism2y ago

sfriedr2y ago

The onus is on the authors to place their research in context and provide compelling arguments - not on the reader to guess why their model was compared against model A, but not model B.

What do I mean by "pertinent"? Of course it is not necessary to cite every "All You Need" paper.

But:

So, PLEASE, don't leave us guessing!

1 more reply

whimsicalism2y ago

bhy2y ago

RWKV has 14B version.

whimsicalism2y ago

good point, for some reason i always leave out rwkv when thinking of the transformer models.. perhaps because it is more of a redux

lumost2y ago

I’m anxiously awaiting the follow up where someone tries spending 1MM+ on demonstrating this approaches effectiveness in a large language model context.

whimsicalism2y ago

> however they do not show that this approach produces similar accuracy as large LLMs.

I think they have demonstrated their case pretty well, unless there is some serious degradation of the scaling - 7b is pretty big.

turingfeel2y ago

[0] https://twitter.com/gordic_aleksa/status/1682479676910870529

1 more reply

wills_forward2y ago

MIT and Microsoft Researchers Introduce "RetNet" - An 8X Faster Transformer Alternative for AI

mirekrusin2y ago

whimsicalism2y ago

the words per second i believe from the first graph in the paper

canjobear2y ago

What’s the MIT connection? The authors all seem to be affiliated with Tsinghua.

j / k navigate · click thread line to collapse