But it continually, wildly performs slower and falls short every time I’ve tried.
If it falls short every time you've tried, it's likely that one or more of these is true:A. You're working on some really deep thing that only world-class expects can do, like optimizing graphics engines for AAA games.
B. You're using a language that isn't in the top ~10 most popular in AI models' training sets.
C. You have an opportunity to improve your ability to use the tools effectively.
How many hours have you spent using Claude Code?
Not exactly world-class software.
Using these tools takes quite a bit of effort but even after doing all those steps to use the tool well, I still got this project done in a few days when it otherwise would have taken me 1-2 months and likely simply would never happened at all.
And whether you have a decent PRD or spec. Are you trying to prompt the harness with one bit at a time, or did you give it a complete spec and ask it to analyze it and break it down into individual issues with dependencies (e.g. using beads and beads_viewer)?
I'm not looking for reasons to criticize your approach or question your experience, but your answers may point to opportunities for you to get more out of these tools.
If you're using Claude Code and you have a friend who has had more success with these tools, consider exporting your transcripts and letting them have a look: https://simonwillison.net/2025/Dec/25/claude-code-transcript...
This is a relatively common skill. One thing I always notice about the video game industry is it's much more globally distributed than the rest of the software industry.
Being bad at writing software is Japan's whole thing but they still make optimized video games.
The issues I ran into are primarily “tail-chasing” ones - it gets into some attractor that doesn’t suit the test case and fails to find its way out. I re-benchmark every few months, but so far none of the frontier models have been able to make changes that have solved the issue without bloating the codebase and failing the perf tests.
It’s fine for some boilerplate dedup or spinning up some web api or whatever, but it’s still not suitable for serious work.
It's insulting that criticism is often met with superficial excuses and insinuation that the user lacks the required skills.
https://mitchellh.com/writing/my-ai-adoption-journey
My experience mirrors that of Mitchell. It absolutely is at the level now where AI can free up time to do the really interesting stuff.
GP said 'falls short every time I’ve tried'. Note the word 'every'.
Claude would be worse than an expert at this, but this is a benchmarkable task. Claude can do experiments a lot quicker than a human can. The hard part would be ensure that the results aren't just gaming your benchmark.
I feel like comparison just to a junior developer is also becoming a fairly outdated comparison. Yes, it is worse in some ways, but also VASTLY superior in others.