Attention Sinks in LLMs for endless fluency (opens in new tab)

(huggingface.co)

12 pointscubie2y ago6 comments

6 comments

cubieOP2y ago

Various experiments on the recent Window Attention with Attention Sinks/StreamingLLM approach indicate that the approach certainly improves inference fluency of pretrained LLMs, while also improving the VRAM usage from linear to constant.

It can be applied to pretrained LLMs with little to no additional effort, and Hugging Face transformers is working on first-party support. Until then, the third-party module in the blogpost already works well.

bpiche2y ago

Appears that this is an open source implementation of the same "Efficient Streaming Language Models with Attention Sinks" paper from MIT, linked here 7 days ago. Published on Sept 29, 2023.

https://news.ycombinator.com/item?id=37740932

cubieOP2y ago

That is exactly correct

spidersouris2y ago

The paper published by Xiao et al. (2023)[0] states that "a surprisingly large amount of attention score is allocated to the initial tokens, irrespective of their relevance to the language modeling task" (p. 2). Does that mean that task prefixes used for LLM generation (e.g. "translate: [sentence]") are actually attention sinks? Or are they not? I don't really understand what they mean by "irrespective of their relevance to the language modeling task."

[0] https://arxiv.org/pdf/2309.17453.pdf

cubieOP2y ago

By "irrespective of their relevance to the language modeling task", the authors mean that the semantic meaning of the tokens is not important. These 4 tokens can be completely replaced by newlines (i.e. tokens with no semantic meaning), and the perplexity as measured on a book of 65k tokens is nearly unaffected.

The clue is really that these tokens are just used to "offload" attention scores - their semantic meaning is irrelevant.

Tostino2y ago

So, llama.cpp already somewhat supports this: https://github.com/ggerganov/llama.cpp/issues/3440

j / k navigate · click thread line to collapse

6 comments

cubieOP2y ago

bpiche2y ago

Appears that this is an open source implementation of the same "Efficient Streaming Language Models with Attention Sinks" paper from MIT, linked here 7 days ago. Published on Sept 29, 2023.

https://news.ycombinator.com/item?id=37740932

cubieOP2y ago

That is exactly correct

spidersouris2y ago

[0] https://arxiv.org/pdf/2309.17453.pdf

cubieOP2y ago

The clue is really that these tokens are just used to "offload" attention scores - their semantic meaning is irrelevant.

Tostino2y ago

So, llama.cpp already somewhat supports this: https://github.com/ggerganov/llama.cpp/issues/3440

j / k navigate · click thread line to collapse