The LLM has one job, to make code that looks plausible. That's it. There's no logic gone into writing that bit of code. So the bugs often won't be like those a programmer makes. Instead, they can introduce a whole new class of bug that's way harder to debug.
Funny story: when I first posted that and had a couple of thousand readers, I had many comments of the type "you should just read the code carefully on review", but _nobody_ pointed out the fact that the opening example (the so called "right code") had the exact same problem as described in the article, proving exactly what you just said: it's hard to spot problems that are caused by plausibility machines.
AI generated code will fuck up so many lives. The post office software in the UK did it without AI. I cannot imagine the way and the number of lives will be ruined since some consultancy vibe coded some government system. I might come to appreciate the German bureaucracy and backwardness.
LLMs are way faster than me at writing tests. Just prompt for the kind of test you want.
I can and do use AI to help with test coverage but coverage is pointless if you don’t catch the interesting edge cases.
Maybe use one LLMs to write the code and a wildly different one to write the tests and yet another wildly different one to generate an English description of each test while doing critical review.
Quality increases if I double check code with a second LLM (especially o4 mini is great for that)
Or double check tests the same way.
Maybe even write tests and code with different LLMs if that is your worry.
Code that doesn't do what you want isn't "working", bro.
Working exactly to spec is the code's only job.
Anyway, this is where AI's have been really bad for us. As well as sometimes "overengineering" their bug prevention in extremely inefficient ways. The flip-side of this is of course that a lot of human programmers would make the same mistakes.
That sounds like a new opportunity for a startup that will collect hundreds of millions a of dollars, brag about how their new AI prototype is so smart that it scares them, and devliver nothing
What makes you say that? If LLMs didn't reason about things, they wouldn't be able to do one hundredth of what they do.
https://news.ycombinator.com/item?id=44163194
https://news.ycombinator.com/item?id=44068943
It doesn't optimize "good programs". It interprets "humans interpretation of good programs." More accurately, "it optimizes what low paid over worked humans believe are good programs." Are you hiring your best and brightest to code review the LLMs?Even if you do, it still optimizes tricking them. It will also optimize writing good programs, but you act like that's a well defined and measurable thing.
You can definitely still run into some of the problems eluded to in the first link. Think hacking unit tests, deception, etc -- but the bar is less "create a perfect RL environment" than "create an RL environment where solving the problem is easier than reward hacking." It might be possible to exploit a bug in the Lean 4 proof assistant to prove a mathematical statement, but I suspect it will usually be easier for an LLM to just write a correct proof. Current RL environments aren't as watertight as Lean 4, but there's certainly work to make them more watertight.
This is in no way a "solved" problem, but I do see it as a counter to your assertion that "This isn't a thing RL can fix." RL is powerful.
If you can't code then the distinction is lost on you, but in fact the "correct" part is why programmers get paid. If "plausible" were good enough then the profession of programmer wouldn't exist.