For Codex, we're making 1M context experimentally available, but we're not making it the default experience for everyone, as from our testing we think that shorter context plus compaction works best for most people. If anyone here wants to try out 1M, you can do so by overriding `model_context_window` and `model_auto_compact_token_limit`.
Curious to hear if people have use cases where they find 1M works much better!
(I work at OpenAI.)
Reverse engineering [1]. When decompiling a bunch of code and tracing functionality, it's really easy to fill up the context window with irrelevant noise and compaction generally causes it to lose the plot entirely and have to start almost from scratch.
(Side note, are there any OpenAI programs to get free tokens/Max to test this kind of stuff?)
Sometimes I’m exploring some topic and that exploration is not useful but only the summary.
Also, you could use the best guess and cli could tell me that this is what it wants to compact and I can tweak its suggestion in natural language.
Context is going to be super important because it is the primary constraint. It would be nice to have serious granular support.
I too tried Codex and found it similarly hard to control over long contexts. It ended up coding an app that spit out millions of tiny files which were technically smaller than the original files it was supposed to optimize, except due to there being millions of them, actual hard drive usage was 18x larger. It seemed to work well until a certain point, and I suspect that point was context window overflow / compaction. Happy to provide you with the full session if it helps.
I’ll give Codex another shot with 1M. It just seemed like cperciva’s case and my own might be similar in that once the context window overflows (or refuses to fill) Codex seems to lose something essential, whereas Claude keeps it. What that thing is, I have no idea, but I’m hoping longer context will preserve it.
Mind you, the bugs it reported were mostly bogus. But at least I was eventually able to convince it to try.
Feels like a losing battle, but hey, the audience is usually right.
It appears only gemini has actual context == effective context from these. Although, I wasn't able to test this neither in gemini cli, nor antigravity with my pro subscription because, well, it appears nobody actually uses these tools at Google.
Like, I'd love an optional pre-compaction step, "I need to compact, here is a high level list of my context + size, what should I junk?" Or similar.
The agent could pre-select what it thinks is worth keeping, but you’d still have full control to override it. Each chunk could have three states: drop it, keep a summarized version, or keep the full history.
That way you stay in control of both the context budget and the level of detail the agent operates with.
But I'm mostly working on personal projects so my time is cheap.
I might experiment with having the file sections post-processed through a token counter though, that's a great idea.
However, in some harnesses the model is given access to the old chat log/"memories", so you'd need a way to provide that. You could compromise by running /compact and pasting the output from your own summarizer (that you ran first, obviously).
The model holds "what has been updated" well at the start of a session. After compaction, it reconstructs from summaries, and that reconstruction is lossy exactly where precision matters most: tracking partially-complete cross-file operations.
1M context isn't about reading more, it's about not forgetting what you already did halfway through.
In our experiments, we see a surprising benefit to rewriting blocks to use more tokens, especially long lists etc..
E.g. compare these two options
"The following conditions are excluded from your contract - condition A - condition B ... - condition Z"
The next one works better for us:
"The following conditions are excluded from your contract - condition A is excluded - condition B is excluded ... - condition Z is excluded"
And we now have scripts to rewrite long documents like this, explicitly adding more tokens. Would you have any opinion on this?
So the closer the two related pieces of information are to each other in the input context, the larger the chance their relationship will be preserved.
For me, I would say speed (not just time to first token, but a complete generation) is more important then going for a larger context size.
I've also had it succeed in attempts to identify some non-trivial bugs that spanned multiple modules.