undefined | Better HN

0 pointsverdverm1mo ago0 comments

> Every API call sends the full tool schema for all available tools.

Only if you are doing it wrong, search >>> summarization

Then the other question, is it deterministic between runs or am I going to get a different summary each session, turn, or toolcall? And depending on that frequency, am I using more token than I save by doing summarization for N tools?

Minimizing token usage is not the goal in of itself, re: the ageless tradeoff of quantity vs quality

For some context, my system prompt is around 5k tokens at the start. I put file contents there read/write/agents.md, which save millions of tokens and seems to work better than making them message parts.

> Just trimmed boilerplate

This is not what I see this tool doing. It's automatically manipulating words in the background that you should put far more care and attention towards. Referring to those words as "boilerplate" you can just throw into a slop machines is revealing

0 comments

aSidorenkoCode1mo ago

the benchmarks show no degradation in task completion with the shorter descriptions. We're in the age where frontier LLMs don't need instructions on how to read or edit a file.

The descriptions aren't dynamically summarized either. They're static in the plugin, same every call, every session. Zero overhead, fully deterministic.

This has been validated in over 3000 benchmark runs in OpenCode and I ran the entire Exercism Python practice suite (https://github.com/exercism/python/tree/main/exercises/pract...) with and without the plugin with identical results. An initial dataset is shared in the repo.

verdvermOP1mo ago

Have you made that benchmarking process open so others could reproduce it?

> with identical results

If your results are identical, you should be very sus, something is wrong if this is true. Nothing in agentic is reliable of fully deterministic

aSidorenkoCode1mo ago

Good benchmark results don't mean identical outputs. The task completion rate is the same: both pass the same exercises. The paths the model takes differ, but the end result is the same -> pass the tests

The full benchmarking methodology and tooling will be published alongside the paper.

1 more reply

j / k navigate · click thread line to collapse

0 pointsverdverm1mo ago0 comments

> Every API call sends the full tool schema for all available tools.

Only if you are doing it wrong, search >>> summarization

Minimizing token usage is not the goal in of itself, re: the ageless tradeoff of quantity vs quality

> Just trimmed boilerplate

0 comments

aSidorenkoCode1mo ago

the benchmarks show no degradation in task completion with the shorter descriptions. We're in the age where frontier LLMs don't need instructions on how to read or edit a file.

The descriptions aren't dynamically summarized either. They're static in the plugin, same every call, every session. Zero overhead, fully deterministic.

verdvermOP1mo ago

Have you made that benchmarking process open so others could reproduce it?

> with identical results

If your results are identical, you should be very sus, something is wrong if this is true. Nothing in agentic is reliable of fully deterministic

aSidorenkoCode1mo ago

The full benchmarking methodology and tooling will be published alongside the paper.

1 more reply

j / k navigate · click thread line to collapse