If you look at the evolution of agent-written code you see that it may start out fine, but as you add more and more features, things go horribly wrong. Let's say the model runs into a wall. Sometimes the right thing to do is go back into the architecture and put a door in that spot; other times the right thing to do is ask why you hit that wall in the first place, maybe you've taken a wrong turn. The models seem to pick one or the other almost at random, and sometimes they just blast a hole through the wall. After enough features, it's clear there's no convergence, just like what happened in Anthropic's experiment. The agents ultimately can't fix one problem without breaking something else.
You can also see how they shoot themselves in the foot by adding layers upon layers of defensive coding that get so think they themselves can't think through them. I once asked an agent to write a data structure that maintains an invariant in subroutine A and uses it in subroutine B. It wrote A fine, but B ignored the invariant and did a brute-force search over the data, the very thing the data structure was meant to avoid. As it was writing it the agent explained that it doesn't want to trust the invariant established in A because it might be buggy... Another thing you frequently see is that the code they write is so intent on success that it has a plan A, plan B, and plan C for everything. It tries to do something one way and adds contingencies for failure.
And so the code and the complexity compound until nothing and no one can save you. If you're lucky, your program is "finished" before that happens. My experience is mostly with gpt5.4 and 5.3-codex, although Anthropic's failed experiment shows that the Claude models suffer from similar problems. What does it say when a compiler expert that knows multiple compilers pretty much by heart, with access to thousands of tests, can't even write a C compiler? Most important software is more complex than a C compiler, isn't as well specified, and the models haven't trained on it.
I wish they could write working code; they just don't.[1] But man, can they debug (mostly because they're tenacious and tireless).
[1]: By which I don't mean they never do, but you really can't trust them to do it as you can a programmer. Knowing to code, like knowing to fly a plane, doesn't mean sometimes getting the right result. It means always getting the right result (within your capabilities that are usually known in advance in the case of humans).