undefined | Better HN

0 pointssimianwords6mo ago0 comments

context window is not some physical barrier but rather the attention just getting saturated. what did i get wrong here?

0 comments

qsort6mo ago

> what did i get wrong here?

You don't know how an LLM works and you are operating on flawed anthropomorphic metaphors.

Ask a frontier LLM what a context window is, it will tell you.

Palmik6mo ago

It's a fair question, even if it might be coming from a place of misunderstanding.

For example, DeepSeek 3.2, which employs sparse attention [1], is not only faster with long context than normal 3.1, but also seems to be better (perhaps thanks to reducing the noise?).

[1] It uses still quadratic router, but it's small, so it scales well in practice. https://api-docs.deepseek.com/news/news250929

ed6mo ago

Parent is likely thinking of sparse attention which allows a significantly longer context to fit in memory

qsort6mo ago

My comment was harsher than it needed to be and I'm sorry, I think I should have gotten my point across in a better way.

With that out of the way, parent was wondering why compaction is necessary arguing that "context window is not some physical barrier but rather the attention just getting saturated". We're trying to explain that 3+2=2+3 and you people are sitting in the back going "well, actually, not all groups are abelian".

paradite6mo ago

In theory, auto-regressive models should not have limit on context. It should generate the next token with all previous tokens.

In practice, when training a model, people select a context window so that during inference, you know how much GPU memory to allocate for a prompt and reject the prompt if it exceeds the memory limit.

Of course there's also degrading performance as context gets longer, but I suspect memory limit is the primary factor of why we have context window limits.

kenjackson6mo ago

I think attention literally doesn't see anything beyond the context window. Even within the context window you may start to see attentional issues, but that's a different problem.

j / k navigate · click thread line to collapse

0 comments

qsort6mo ago

> what did i get wrong here?

You don't know how an LLM works and you are operating on flawed anthropomorphic metaphors.

Ask a frontier LLM what a context window is, it will tell you.

Palmik6mo ago

It's a fair question, even if it might be coming from a place of misunderstanding.

For example, DeepSeek 3.2, which employs sparse attention [1], is not only faster with long context than normal 3.1, but also seems to be better (perhaps thanks to reducing the noise?).

[1] It uses still quadratic router, but it's small, so it scales well in practice. https://api-docs.deepseek.com/news/news250929

ed6mo ago

Parent is likely thinking of sparse attention which allows a significantly longer context to fit in memory

qsort6mo ago

My comment was harsher than it needed to be and I'm sorry, I think I should have gotten my point across in a better way.

paradite6mo ago

In theory, auto-regressive models should not have limit on context. It should generate the next token with all previous tokens.

Of course there's also degrading performance as context gets longer, but I suspect memory limit is the primary factor of why we have context window limits.

kenjackson6mo ago

I think attention literally doesn't see anything beyond the context window. Even within the context window you may start to see attentional issues, but that's a different problem.

j / k navigate · click thread line to collapse