undefined | Better HN

0 pointspron1mo ago0 comments

My problem with the code the agents produce has nothing to do with style or art. The clearest example of how bad it is was shown by Anthropic's experiements where agents failed to write a C compiler, which is not a very hard programming job to begin with if you know compilers, as the models do, but they failed even with a practically unrealistic level of assistance (a complete spec, thousands of human-written tests, and a reference implementation used as an oracle, not to mention that the models were trained on both the spec and reference implementation).

If you look at the evolution of agent-written code you see that it may start out fine, but as you add more and more features, things go horribly wrong. Let's say the model runs into a wall. Sometimes the right thing to do is go back into the architecture and put a door in that spot; other times the right thing to do is ask why you hit that wall in the first place, maybe you've taken a wrong turn. The models seem to pick one or the other almost at random, and sometimes they just blast a hole through the wall. After enough features, it's clear there's no convergence, just like what happened in Anthropic's experiment. The agents ultimately can't fix one problem without breaking something else.

You can also see how they shoot themselves in the foot by adding layers upon layers of defensive coding that get so think they themselves can't think through them. I once asked an agent to write a data structure that maintains an invariant in subroutine A and uses it in subroutine B. It wrote A fine, but B ignored the invariant and did a brute-force search over the data, the very thing the data structure was meant to avoid. As it was writing it the agent explained that it doesn't want to trust the invariant established in A because it might be buggy... Another thing you frequently see is that the code they write is so intent on success that it has a plan A, plan B, and plan C for everything. It tries to do something one way and adds contingencies for failure.

And so the code and the complexity compound until nothing and no one can save you. If you're lucky, your program is "finished" before that happens. My experience is mostly with gpt5.4 and 5.3-codex, although Anthropic's failed experiment shows that the Claude models suffer from similar problems. What does it say when a compiler expert that knows multiple compilers pretty much by heart, with access to thousands of tests, can't even write a C compiler? Most important software is more complex than a C compiler, isn't as well specified, and the models haven't trained on it.

I wish they could write working code; they just don't.[1] But man, can they debug (mostly because they're tenacious and tireless).

[1]: By which I don't mean they never do, but you really can't trust them to do it as you can a programmer. Knowing to code, like knowing to fly a plane, doesn't mean sometimes getting the right result. It means always getting the right result (within your capabilities that are usually known in advance in the case of humans).

0 comments

simianwords1mo ago

The thing is for most places the kind of code they write is good enough. You have painted an awfully pessimistic picture that frankly does not mirror reality of many enterprises.

> What does it say when a compiler expert that knows multiple compilers pretty much by heart, with access to thousands of tests, can't even write a C compiler?

It does not know compilers by heart. That's just not true. The point of the experiment was to see how big of a codebase it can handle without human intervention and now we know the limits. The limitation has always been context size.

>By which I don't mean they never do, but you really can't trust them to do it as you can a programmer. Knowing to code, like knowing to fly a plane, doesn't mean sometimes getting the right result. It means always getting the right result (within your capabilities that are usually known in advance in the case of humans).

Getting things right ~90% of the time still saves me a lot of time. In fact I would assume this is how autopilot also works in that it does 90% of a job and the pilot is required to supervise it.

pronOP1mo ago

> The thing is for most places the kind of code they write is good enough.

The kind of code they write is the kind of code that will be unsalvageable after 10-50 changes. That's throwaway code, although it looks good. I don't think that's good enough for most places.

Of course, if you really take the time to slowly and carefully review what they write (that many people say they do, but the results don't look like it) you can keep the agents on course with a lot of babysitting and a lot of "revert everything you did in this last iteration".

> You have painted an awfully pessimistic picture that frankly does not mirror reality of many enterprises.

Why pessimistic? The agents are truly remarkable at debugging, and they're very good at reviews. They just can't really code. Interestingly, if you ask codex to review other codex-written code it will often show you just how bad it is, it's just that if you loop coding and review, the agents don't converge.

> It does not know compilers by heart. That's just not true.

It is true. The models can reproduce large swathes of their training material with pretty good accuracy.

> The point of the experiment was to see how big of a codebase it can handle without human intervention and now we know the limits.

What they produced was 100KLOC, which is 5-10x larger than some production C compilers, but even 100KLOC isn't a big codebase. And the amount of human intervention in that experiment was huge: humans wrote specs, thousands of tests, a reference implementation and trained the model on all of those. In most software, at least two or three of these four efforts are not realistic.

What they didn't have is close and careful supervision of every coding iteration. If you really do that - i.e. carefully read every line of plausible-looking code and you think about it - fine; if not, you're in for a nasty surprise when it's too late.

> The limitation has always been context size.

I don't buy it because human context size - especially in this case, where the model has been trained on everything - is smaller, and yet writing a C compiler isn't hard for a person to do.

> Getting things right ~90% of the time still saves me a lot of time.

They might get things right ~75% of the time when they write no more than a few hundred lines of code (unless we're talking a mechanical transformation). Anything beyond that is right closer to 10% of the time. The problem is that it works, at first, close to 90% of the time, but not in a way that will survive evolution for long. So if you're okay with code that works today but won't work a year from today, you might get away with it. I think some people are betting that the models a year from now will be able to fix the code written by today's models. Maybe they're right.

But the agents certainly save a lot of time on debugging and review. Coding - not so much, except in refactorings etc..

j / k navigate · click thread line to collapse