undefined | Better HN

0 pointsjszymborski11mo ago0 comments

The argument that I've heard against LLMs for code is that they create bugs that, by design, are very difficult to spot.

The LLM has one job, to make code that looks plausible. That's it. There's no logic gone into writing that bit of code. So the bugs often won't be like those a programmer makes. Instead, they can introduce a whole new class of bug that's way harder to debug.

0 comments

vanschelven11mo ago

This is exactly what I wrote about when I wrote "Copilot Induced Crash" [0]

Funny story: when I first posted that and had a couple of thousand readers, I had many comments of the type "you should just read the code carefully on review", but _nobody_ pointed out the fact that the opening example (the so called "right code") had the exact same problem as described in the article, proving exactly what you just said: it's hard to spot problems that are caused by plausibility machines.

[0] https://www.bugsink.com/blog/copilot-induced-crash/

okanat11mo ago

If it crashes, you are very lucky.

AI generated code will fuck up so many lives. The post office software in the UK did it without AI. I cannot imagine the way and the number of lives will be ruined since some consultancy vibe coded some government system. I might come to appreciate the German bureaucracy and backwardness.

intrasight11mo ago

My philosophy is to let the LLM either write the logic or write the tests - but not both. If you write the tests and it writes the logic and it passes all of your tests, then the LLM did its job. If there are bugs, there were bugs in your tests.

underdeserver11mo ago

That rather depends on the type of bug and what kinds of tests you would write.

LLMs are way faster than me at writing tests. Just prompt for the kind of test you want.

1 more reply

lelanthran11mo ago

> My philosophy is to let the LLM either write the logic or write the tests - but not both. If you write the tests and it writes the logic and it passes all of your tests, then the LLM did its job. If there are bugs, there were bugs in your tests.

Maybe use one LLMs to write the code and a wildly different one to write the tests and yet another wildly different one to generate an English description of each test while doing critical review.

joshribakoff11mo ago

Disagree. You could write millions of tests for a function that simply sums two numbers, and it’s trivial to insert bugs while passing that test.

dfedbeef11mo ago

This is pretty nifty, going to try this out!

therealmarv11mo ago

I don't agree. What I do agree on is to do it not only with one LLM.

Quality increases if I double check code with a second LLM (especially o4 mini is great for that)

Or double check tests the same way.

Maybe even write tests and code with different LLMs if that is your worry.

fisherjeff11mo ago

Yes, exactly - my (admittedly very limited!) experience has consistently generated well-written, working code that just doesn’t quite do what I asked. Often the results will be close to what I expect, and the coding errors do not necessarily jump out on a first line-by-line pass, so if I didn’t have a high degree of skepticism of the generated code in the first place, I could easily just run with it.

otabdeveloper411mo ago

> working code that just doesn’t quite do what I asked

Code that doesn't do what you want isn't "working", bro.

Working exactly to spec is the code's only job.

1 more reply

devjab11mo ago

For me it's mostly about the efficiency of the code they write. This is because I work in energy where efficiency matters because our datasets are so ridicilously large and every interface to that data is so ridicilously bad. I'd argue that for 95% of the software out there it won't really matter if you use a list or a generator in Python to iterate over data. It probably should and maybe this will change with cloud costs continious increasing, but we do also live in a world where 4chan ran on some apache server running a 10k line php file from 2015...

Anyway, this is where AI's have been really bad for us. As well as sometimes "overengineering" their bug prevention in extremely inefficient ways. The flip-side of this is of course that a lot of human programmers would make the same mistakes.

deanc11mo ago

I’ve had the opposite experience. Just tell it to optimise for speed and iterate and give feedback. I’ve had JS code optimised specifically for v8 using bitwise operations. It’s brilliant.

1 more reply

alex98911mo ago

>Instead, they can introduce a whole new class of bug that's way harder to debug

That sounds like a new opportunity for a startup that will collect hundreds of millions a of dollars, brag about how their new AI prototype is so smart that it scares them, and devliver nothing

DiogenesKynikos11mo ago

> There's no logic gone into writing that bit of code.

What makes you say that? If LLMs didn't reason about things, they wouldn't be able to do one hundredth of what they do.

mindwok11mo ago

This is a misunderstanding. Modern LLMs are trained with RL to actually write good programs. They aren't just spewing tokens out.

godelski11mo ago

No, YOU misunderstand. This isn't a thing RL can fix

  https://news.ycombinator.com/item?id=44163194

  https://news.ycombinator.com/item?id=44068943

It doesn't optimize "good programs". It interprets "humans interpretation of good programs." More accurately, "it optimizes what low paid over worked humans believe are good programs." Are you hiring your best and brightest to code review the LLMs?

Even if you do, it still optimizes tricking them. It will also optimize writing good programs, but you act like that's a well defined and measurable thing.

3 more replies

otabdeveloper411mo ago

"Good" is the context of LLMs means "plausible". Not "correct".

If you can't code then the distinction is lost on you, but in fact the "correct" part is why programmers get paid. If "plausible" were good enough then the profession of programmer wouldn't exist.

1 more reply

cma11mo ago

They are also trained with RL to write code to pass unit tests and Claude does have a big problem with trying to cheat the test or request pretty quickly after running into issues, making manual edit approval more important. It usually still tells what it is trying to do wrong so you can often find out from its summary before having to scan the diff.

meowface11mo ago

This can happen, but in practice, given I'm reviewing every line anyway, it almost never bites me.

j / k navigate · click thread line to collapse

0 comments

vanschelven11mo ago

This is exactly what I wrote about when I wrote "Copilot Induced Crash" [0]

[0] https://www.bugsink.com/blog/copilot-induced-crash/

okanat11mo ago

If it crashes, you are very lucky.

intrasight11mo ago

underdeserver11mo ago

That rather depends on the type of bug and what kinds of tests you would write.

LLMs are way faster than me at writing tests. Just prompt for the kind of test you want.

1 more reply

lelanthran11mo ago

Maybe use one LLMs to write the code and a wildly different one to write the tests and yet another wildly different one to generate an English description of each test while doing critical review.

joshribakoff11mo ago

Disagree. You could write millions of tests for a function that simply sums two numbers, and it’s trivial to insert bugs while passing that test.

dfedbeef11mo ago

This is pretty nifty, going to try this out!

therealmarv11mo ago

I don't agree. What I do agree on is to do it not only with one LLM.

Quality increases if I double check code with a second LLM (especially o4 mini is great for that)

Or double check tests the same way.

Maybe even write tests and code with different LLMs if that is your worry.

fisherjeff11mo ago

otabdeveloper411mo ago

> working code that just doesn’t quite do what I asked

Code that doesn't do what you want isn't "working", bro.

Working exactly to spec is the code's only job.

1 more reply

devjab11mo ago

deanc11mo ago

I’ve had the opposite experience. Just tell it to optimise for speed and iterate and give feedback. I’ve had JS code optimised specifically for v8 using bitwise operations. It’s brilliant.

1 more reply

alex98911mo ago

>Instead, they can introduce a whole new class of bug that's way harder to debug

That sounds like a new opportunity for a startup that will collect hundreds of millions a of dollars, brag about how their new AI prototype is so smart that it scares them, and devliver nothing

DiogenesKynikos11mo ago

> There's no logic gone into writing that bit of code.

What makes you say that? If LLMs didn't reason about things, they wouldn't be able to do one hundredth of what they do.

mindwok11mo ago

This is a misunderstanding. Modern LLMs are trained with RL to actually write good programs. They aren't just spewing tokens out.

godelski11mo ago

No, YOU misunderstand. This isn't a thing RL can fix

  https://news.ycombinator.com/item?id=44163194

  https://news.ycombinator.com/item?id=44068943

Even if you do, it still optimizes tricking them. It will also optimize writing good programs, but you act like that's a well defined and measurable thing.

3 more replies

otabdeveloper411mo ago

"Good" is the context of LLMs means "plausible". Not "correct".

If you can't code then the distinction is lost on you, but in fact the "correct" part is why programmers get paid. If "plausible" were good enough then the profession of programmer wouldn't exist.

1 more reply

cma11mo ago

meowface11mo ago

This can happen, but in practice, given I'm reviewing every line anyway, it almost never bites me.

j / k navigate · click thread line to collapse