As lots of large and small companies have shown, test suites can only find what you test for. Vibe coded test suites can find?
It does a reasonable job. Its also pretty good at writing regression tests when it fixes a bug.
Where LLMs struggle - or at least where claude struggles - is fixing the actual bugs. Its very good at getting the test suite to pass. But it cheats. It'll sometimes disable a test, or do some hacky workaround that makes the test pass that doesn't fix the underlying issue. It'll say "All done, the tests pass". But sometimes you really wish they didn't.
I'm wondering if it might be better to set up 2 agents adversarially for bug hunting. Give one agent the goal of finding as many bugs as possible (via tests and other techniques). And another agent has the goal of fixing the bugs.
I’ve tried all sorts of things to keep Claude from cheating, but the only one that works is to restrict access to the tests files, which obviously isn’t a real solution.
We recently had an “AI week” at work and I spent $1000 in tokens trying out different iterations of this.