—
v0.5.0 was about figuring out why models weren’t using tilth tools consistently — even when they were available.
Results vs baseline (built-in tools only):
Sonnet 4.6: -44% $/correct (84% → 94% accuracy, 31% fewer turns)
Opus 4.6: -39% $/correct (91% → 92% accuracy, 37% fewer turns)
Haiku 4.5: -38% $/correct (54% → 73% accuracy, 7% fewer turns)
—
https://github.com/jahala/tilth/
Full results: https://github.com/jahala/tilth/blob/main/benchmark/README.m...
— PS: I don't have the budget to run the benchmark a lot (especially with Opus), so if any token whales has capacity to run some benchmarks, please feel free to PR results.