undefined | Better HN

0 pointsconsumer4510y ago0 comments

As the context grows, all LLMs appear to turn into idiots, even just at 32k!

> We evaluate 12 popular LLMs that claim to support contexts of at least 128K tokens. While they perform well in short contexts (<1K), performance degrades significantly as context length increases. At 32K, for instance, 10 models drop below 50% of their strong short-length baselines. Even GPT-4o, one of the top-performing exceptions, experiences a reduction from an almost-perfect baseline of 99.3% to 69.7%.

https://news.ycombinator.com/item?id=44107536

0 comments

rs1860y ago

This paper is slightly outdated by LLM model standards -- GPT 4.1 or Gemini 2.5 haven't been released at that time.

consumer451OP0y ago

Yes, I mentioned that in the comment in the linked post. I wish someone was running this methodology as an ongoing project, for new models.

Ideally, isn't this a metric that should be included on all model cards? It seems like a crucial metric.

j / k navigate · click thread line to collapse