undefined | Better HN

0 pointstedsanders20d ago0 comments

Yeah, long context vs compaction is always an interesting tradeoff. More information isn't always better for LLMs, as each token adds distraction, cost, and latency. There's no single optimum for all use cases.

For Codex, we're making 1M context experimentally available, but we're not making it the default experience for everyone, as from our testing we think that shorter context plus compaction works best for most people. If anyone here wants to try out 1M, you can do so by overriding `model_context_window` and `model_auto_compact_token_limit`.

Curious to hear if people have use cases where they find 1M works much better!

(I work at OpenAI.)

0 comments

akiselev20d ago

> Curious to hear if people have use cases where they find 1M works much better!

Reverse engineering [1]. When decompiling a bunch of code and tracing functionality, it's really easy to fill up the context window with irrelevant noise and compaction generally causes it to lose the plot entirely and have to start almost from scratch.

(Side note, are there any OpenAI programs to get free tokens/Max to test this kind of stuff?)

[1] https://github.com/akiselev/ghidra-cli

fragmede20d ago

OpenAi has program for trusted cybersecurity researchers https://openai.com/index/trusted-access-for-cyber/

simianwords20d ago

Do you maybe want to give us users some hints on what to compact and throw away? In codex CLI maybe you can create a visual tool that I can see and quickly check mark things I want to discard.

Sometimes I’m exploring some topic and that exploration is not useful but only the summary.

Also, you could use the best guess and cli could tell me that this is what it wants to compact and I can tweak its suggestion in natural language.

Context is going to be super important because it is the primary constraint. It would be nice to have serious granular support.

sillysaurusx20d ago

You may want to look over this thread from cperciva: https://x.com/cperciva/status/2029645027358495156

I too tried Codex and found it similarly hard to control over long contexts. It ended up coding an app that spit out millions of tiny files which were technically smaller than the original files it was supposed to optimize, except due to there being millions of them, actual hard drive usage was 18x larger. It seemed to work well until a certain point, and I suspect that point was context window overflow / compaction. Happy to provide you with the full session if it helps.

I’ll give Codex another shot with 1M. It just seemed like cperciva’s case and my own might be similar in that once the context window overflows (or refuses to fill) Codex seems to lose something essential, whereas Claude keeps it. What that thing is, I have no idea, but I’m hoping longer context will preserve it.

FrankBooth20d ago

What’s the connection with context size in that thread? It seems more like an instruction following problem.

cperciva20d ago

Yeah, I would definitely characterize it as an instruction following problem. After a few more round trips I got it to admit that "my earlier passes leaned heavily on build/tests + targeted reads, which can miss many “deep” bugs that only show up under specific conditions or with careful semantic review" and then asking it to "Please do a careful semantic review of files, one by one." started it on actually reviewing code.

Mind you, the bugs it reported were mostly bogus. But at least I was eventually able to convince it to try.

sillysaurusx20d ago

It occurred to me that searching 196 .c files was a context window issue, but maybe there’s something else going on. Either way, Codex could behave better.

woadwarrior0120d ago

Please don't post links with tracking parameters (t=jQb...).

https://xcancel.com/cperciva/status/2029645027358495156

sillysaurusx20d ago

Haha. This was the second time in like a year that I’ve posted a Twitter link, and the second time someone complained. Okay, I’ll try to remove those before posting, and I’ll edit this one out.

Feels like a losing battle, but hey, the audience is usually right.

1 more reply

lubesGordi20d ago

It's funny that the context window size is such a thing still. Like the whole LLM 'thing' is compression. Why can't we figure out some equally brilliant way of handling context besides just storing text somewhere and feeding it to the llm? RAG is the best attempt so far. We need something like a dynamic in flight llm/data structure being generated from the context that the agent can query as it goes.

Kostchei17d ago

My favorite solution is a lower parameter 5 layer model trained on the data that acts as a local compression and response, a neurocortext layer wrapped around any large persistent data you have to interact with and ...... maybe also a specialist tool that spins up which is built with that data in mind but is deterministic in it's approach- sort of a just-in-time index or adaptive indexing

le-mark20d ago

That’s actually a pretty cool idea. When I think about my internal mental model of a codebase I’m working on it’s definitely a compacted lossy thing that evolves as I learn more.

nowittyusername20d ago

Personally what I am more interested about is effective context window. I find that when using codex 5.2 high, I preferred to start compaction at around 50% of the context window because I noticed degradation at around that point. Though as of a bout a month ago that point is now below that which is great. Anyways, I feel that I will not be using that 1 million context at all in 5.4 but if the effective window is something like 400k context, that by itself is already a huge win. That means longer sessions before compaction and the agent can keep working on complex stuff for longer. But then there is the issue of intelligence of 5.4. If its as good as 5.2 high I am a happy camper, I found 5.3 anything... lacking personally.

gck120d ago

Not sure how accurate this is, but found contextarena benchmarks today when I had the same question.

It appears only gemini has actual context == effective context from these. Although, I wasn't able to test this neither in gemini cli, nor antigravity with my pro subscription because, well, it appears nobody actually uses these tools at Google.

https://contextarena.ai/?showLabels=false

Someone123420d ago

That's an interesting point regarding context Vs. compaction. If that's viewed as the best strategy, I'd hope we would see more tools around compaction than just "I'll compact what I want, brace yourselves" without warning.

Like, I'd love an optional pre-compaction step, "I need to compact, here is a high level list of my context + size, what should I junk?" Or similar.

thyb2320d ago

This is exactly how it should work. I imagine it as a tree view showing both full and summarized token counts at each level, so you can immediately see what’s taking up space and what you’d gain by compacting it.

The agent could pre-select what it thinks is worth keeping, but you’d still have full control to override it. Each chunk could have three states: drop it, keep a summarized version, or keep the full history.

That way you stay in control of both the context budget and the level of detail the agent operates with.

joquarky20d ago

I compact myself by having it write out to a file, I prune what's no longer relevant, and then start a new session with that file.

But I'm mostly working on personal projects so my time is cheap.

I might experiment with having the file sections post-processed through a token counter though, that's a great idea.

Folcon20d ago

I do find it really interesting that more coding agents don't have this as an toggleable feature, sometimes you really need this level of control to get useful capability

1 more reply

joshvm20d ago

Have you tried writing that as a skill? Compaction is just a prompt with a convenient UI to keep you in the same tab. There's no reason you can't ask the model to do that yourself and start a new conversation. You can look up Claude's /compact definition, for reference.

However, in some harnesses the model is given access to the old chat log/"memories", so you'd need a way to provide that. You could compromise by running /compact and pasting the output from your own summarizer (that you ran first, obviously).

mindplunge20d ago

Frontend work with large component libraries. When I'm refactoring shared design system components, things like a token system that touches 80+ files, compaction tends to lose the thread on which downstream components have already been updated vs which still need changes. It ends up re-doing work or missing things silently.

The model holds "what has been updated" well at the start of a session. After compaction, it reconstructs from summaries, and that reconstruction is lossy exactly where precision matters most: tracking partially-complete cross-file operations.

1M context isn't about reading more, it's about not forgetting what you already did halfway through.

jmward0120d ago

What needs to be an option is to allow complete and then compact and if needed go into the 1m version. That way you can get the most out of the shorter window but in the case where it just couldn't finish and compact in time it will (at cost) go over. I wonder how many tokens are actually left at the end of compaction on average. I know there have been many times where I likely needed just another 10-20k and a better stopping point would have been there.

dahcryn20d ago

I would like to counteract your statement that each token adds a distraction.

In our experiments, we see a surprising benefit to rewriting blocks to use more tokens, especially long lists etc..

E.g. compare these two options

"The following conditions are excluded from your contract - condition A - condition B ... - condition Z"

The next one works better for us:

"The following conditions are excluded from your contract - condition A is excluded - condition B is excluded ... - condition Z is excluded"

And we now have scripts to rewrite long documents like this, explicitly adding more tokens. Would you have any opinion on this?

mnicky20d ago

This observation makes sense, because all models currently probably use some kind of a sparse attention architecture.

So the closer the two related pieces of information are to each other in the input context, the larger the chance their relationship will be preserved.

asabla20d ago

I really don't have any numbers to back this up. But it feels like the sweet spot is around ~500k context size. Anything larger then that, you usually have scoping issues, trying to do too much at the same time, or having having issues with the quality of what's in the context at all.

For me, I would say speed (not just time to first token, but a complete generation) is more important then going for a larger context size.

gspetr20d ago

I have found a bigger context window qute useful when trying to make sense of larger codebases. Generating documentation on how different components interact is better than nothing, especially if the code has poor test coverage.

I've also had it succeed in attempts to identify some non-trivial bugs that spanned multiple modules.

oidar20d ago

context distillation mostly. Agents tend to report success too early if they find something close to what they need for the task. If you are able to shove it in a 1M context, it's impossible for them to give up looking, it's in the context. But for actual implementation, it's not useful at all. They get derailed with too long of a context.

neom20d ago

On Claude Code (sorry) the big context window is good for teams. On CC if you hit compact while a bunch of teams working it's a total shit show after.

j / k navigate · click thread line to collapse

0 comments

akiselev20d ago

> Curious to hear if people have use cases where they find 1M works much better!

(Side note, are there any OpenAI programs to get free tokens/Max to test this kind of stuff?)

[1] https://github.com/akiselev/ghidra-cli

fragmede20d ago

OpenAi has program for trusted cybersecurity researchers https://openai.com/index/trusted-access-for-cyber/

simianwords20d ago

Do you maybe want to give us users some hints on what to compact and throw away? In codex CLI maybe you can create a visual tool that I can see and quickly check mark things I want to discard.

Sometimes I’m exploring some topic and that exploration is not useful but only the summary.

Also, you could use the best guess and cli could tell me that this is what it wants to compact and I can tweak its suggestion in natural language.

Context is going to be super important because it is the primary constraint. It would be nice to have serious granular support.

sillysaurusx20d ago

You may want to look over this thread from cperciva: https://x.com/cperciva/status/2029645027358495156

FrankBooth20d ago

What’s the connection with context size in that thread? It seems more like an instruction following problem.

cperciva20d ago

Mind you, the bugs it reported were mostly bogus. But at least I was eventually able to convince it to try.

sillysaurusx20d ago

It occurred to me that searching 196 .c files was a context window issue, but maybe there’s something else going on. Either way, Codex could behave better.

woadwarrior0120d ago

Please don't post links with tracking parameters (t=jQb...).

https://xcancel.com/cperciva/status/2029645027358495156

sillysaurusx20d ago

Haha. This was the second time in like a year that I’ve posted a Twitter link, and the second time someone complained. Okay, I’ll try to remove those before posting, and I’ll edit this one out.

Feels like a losing battle, but hey, the audience is usually right.

1 more reply

lubesGordi20d ago

Kostchei17d ago

le-mark20d ago

That’s actually a pretty cool idea. When I think about my internal mental model of a codebase I’m working on it’s definitely a compacted lossy thing that evolves as I learn more.

nowittyusername20d ago

gck120d ago

Not sure how accurate this is, but found contextarena benchmarks today when I had the same question.

https://contextarena.ai/?showLabels=false

Someone123420d ago

Like, I'd love an optional pre-compaction step, "I need to compact, here is a high level list of my context + size, what should I junk?" Or similar.

thyb2320d ago

That way you stay in control of both the context budget and the level of detail the agent operates with.

joquarky20d ago

I compact myself by having it write out to a file, I prune what's no longer relevant, and then start a new session with that file.

But I'm mostly working on personal projects so my time is cheap.

I might experiment with having the file sections post-processed through a token counter though, that's a great idea.

Folcon20d ago

I do find it really interesting that more coding agents don't have this as an toggleable feature, sometimes you really need this level of control to get useful capability

1 more reply

joshvm20d ago

mindplunge20d ago

1M context isn't about reading more, it's about not forgetting what you already did halfway through.

jmward0120d ago

dahcryn20d ago

I would like to counteract your statement that each token adds a distraction.

In our experiments, we see a surprising benefit to rewriting blocks to use more tokens, especially long lists etc..

E.g. compare these two options

"The following conditions are excluded from your contract - condition A - condition B ... - condition Z"

The next one works better for us:

"The following conditions are excluded from your contract - condition A is excluded - condition B is excluded ... - condition Z is excluded"

And we now have scripts to rewrite long documents like this, explicitly adding more tokens. Would you have any opinion on this?

mnicky20d ago

This observation makes sense, because all models currently probably use some kind of a sparse attention architecture.

So the closer the two related pieces of information are to each other in the input context, the larger the chance their relationship will be preserved.

asabla20d ago

For me, I would say speed (not just time to first token, but a complete generation) is more important then going for a larger context size.

gspetr20d ago

I've also had it succeed in attempts to identify some non-trivial bugs that spanned multiple modules.

oidar20d ago

neom20d ago

On Claude Code (sorry) the big context window is good for teams. On CC if you hit compact while a bunch of teams working it's a total shit show after.

j / k navigate · click thread line to collapse