Set the boundaries and guidelines before it starts working. Don't leave it space to do things you don't understand.
ie: enforce conventions, set specific and measurable/verifiable goals, define skeletons of the resulting solutions if you want/can.
To give an example. I do a lot of image similarity stuff and I wanted to test the Redis VectorSet stuff when it was still in beta and the PHP extension for redis (the fastest one, which is written in C and is a proper language extension not a runtime lib) didn't support the new commands. I cloned the repo, fired up claude code and pointed it to a local copy of the Redis VectorSet documentation I put in the directory root telling it I wanted it to update the extension to provide support for the new commands I would want/need to handle VectorSets. This was, idk, maybe a year ago. So not even Opus. It nailed it. But I chickened out about pushing that into a production environment, so I then told it to just write me a PHP run time client that mirrors the functionality of Predis (pure-php implementation of redis client) but does so via shell commands executed by php (lmao, I know).
Define the boundaries, give it guard rails, use design patterns and examples (where possible) that can be used as reference.
All the old techniques and concepts still apply.
You say "Do this thing".
- It does the thing (takes 15 min). Looks incredibly fast. I couldn't code that fast. It's inhuman. So far all the fantastical claims hold up.
But still. You ask "Did you do the thing?"
- it says oops I forgot to do that sub-thing. (+5m)
- it fixes the sub-thing (+10m)
You say is the change well integrated with the system?
- It says not really, let me rehash this a bit. (+5m)
- It irons out the wrinkles (+10m)
You say does this follow best engineering practices, is it good code, something we can be proud of?
- It says not really, here are some improvements. (+5m)
- It implements the best practices (+15m)
You say to look carefully at the change set and see if it can spot any potential bugs or issues.
- It says oh, I've introduced a race condition at line 35 in file foo and an null correctness bug at line 180 of file bar. Fixing. (+15m)
You ask if there's test coverage for these latest fixes?
- It says "i forgor" and adds them. (+15m)
Now the change set has shrunk a bit and is superficially looking good. Still, you must read the code line by line, and with an experienced eye will still find weird stuff happening in several of the functions, there's redundant operations, resources aren't always freed up. (60m)
You ask why it's implemented in such a roundabout way and how it intends for the resources to be freed up?
- It says "you're absolutely right" and rewrites the functions. (+15m)
You ask if there's test coverage for these latest fixes?
- It says "i forgor" and adds them. (+15m)
Now the 15 minutes of amazingly fast AI code gen has ballooned into taking most of the afternoon.
Telling Claude to be diligent, not write bugs, or to write high quality code flat out does not work. And even if such prompting can reduce the odds of omissions or lapses, you still always always always have to check the output. It can not find all the bugs and mistakes on its own. If there are bugs in its training data, you can assume there will be bugs in its output.
(You can make it run through much of this Socratic checklist on its own, but this doesn't really save wall clock time, and doesn't remove the need for manual checking.)
Scale/quantity matter.
This industry is not mature enough for 1000x the bad code we have now. It was barely hanging on with 1x bad code.
It's still pretty fast even considering all the coaxing needed, but holy crap will it rapidly deteriorate the quality of a code base if you just let it make changes as it pleases.
It very much feels like how the most vexing enemy of The Flash is like just some random ass banana peel on the road. Raw speed isn't always an asset.