Yeah, LLM capabilities are measured with fresh context windows, yet people want to use them with 50k, 100k, 500k tokens.
As you pack in more and more context the model's abilities really start to deteriorate.
The first 10k tokens are the juiciest, after that it just gets worse and worse.