undefined | Better HN

0 pointssoerxpso21h ago0 comments

My understanding of caching with most models/providers is that a prefix substring of the context has to be reused for a cache hit, but not necessarily the whole entire context window. So if you prune tool calls from the history, you're going to get one cache miss on the newly-pruned history, and then you're going to be getting cache hits on every subsequent turn, with a lower number of input tokens. If you prune subsequent tool calls after that, you would still get a cache hit for the already-pruned portion of the context, just not the full context.

0 comments

__natty__20h ago

So it makes sense to first send stable prompt, reasoning and files content, tool calls summary and actual tool calls at the very end?

leemoore17h ago

The way you do this (and the way opencode does it) is you do most of your pruning in more recent history. Last I looked at opencode, they start pruning tool call results after 2 full agentic turns. So you probably dont get quite as good hits on cache for the most recent 1-5% of your turns, but after that everything else caches fine and those tool calls that likely aren't relavent to your session anymore are gone.

j / k navigate · click thread line to collapse

0 comments

__natty__20h ago

So it makes sense to first send stable prompt, reasoning and files content, tool calls summary and actual tool calls at the very end?

leemoore17h ago

j / k navigate · click thread line to collapse