undefined | Better HN

0 pointsblibble10mo ago0 comments

> One thing that happened here is that they aren't using current LLMs

I've been hearing this for 2 years now

the previous model retroactively becomes total dogshit the moment a new one is released

convenient, isn't it?

0 comments

If you interact with internet comments and discussions as an amorphous blob of people you'll see a constant trickle of the view that models now are useful, and before were useless.

If you pay attention to who says it, you'll find that people have different personal thresholds for finding llms useful, not that any given person like steveklabnik above keeps flip-flopping on their view.

This is a variant on the goomba fallacy: https://englishinprogress.net/gen-z-slang/goomba-fallacy-exp...

steveklabnik10mo ago

Sorry, that’s not my take. I didn’t think these tools were useful until the latest set of models, that is, they crossed the threshold of usefulness to me.

Even then though, “technology gets better over time” shouldn’t be surprising, as it’s pretty common.

mattmanser10mo ago

Do you really see a massive jump?

For context, I've been using AI, a mix of OpenAi + Claude, mainly for bashing out quick React stuff. For over a year now. Anything else it's generally rubbish and slower than working without. Though I still use it to rubber duck, so I'm still seeing the level of quality for backend.

I'd say they're only marginally better today than they were even 2 years ago.

Every time a new model comes out you get a bunch of people raving how great the new one is and I honestly can't really tell the difference. The only real difference is reasoning models actually slowed everything down, but now I see its reasoning. It's only useful because I often spot it leaving out important stuff from the final answer.

simonw10mo ago

The massive jump in the last six months is that the new set of "reasoning" models got really good at reasoning about when to call tools, and were accompanied is by a flurry of tools-in-loop coding agents - Claude Code, OpenAI Codex, Cursor in Agent mode etc.

An LLM that can test the code it is writing and then iterate to fix the bugs turns out to be a huge step forward from LLMs that just write code without trying to then exercise it.

vidarh10mo ago

I've gone from asking the tools how to do things, and cut and pasting the bits (often small) that'd be helpful, via using assistants that I'd review every decision of and often having to start over, to now often starting an assistant with broad permissions and just reviewing the diff later, after they've made the changes pass the test suite, run a linter and fixed all the issues it brought up, and written a draft commit message.

The jump has been massive.

otabdeveloper410mo ago

> but now I see its reasoning

It's not showing its reasoning. "Reasoning" models are trained to output more tokens in the hope that more tokens means less hallucinations.

It's just a marketing trick and there is no evidence this sort of fake ""reasoning"" actually gives any benefit.

steveklabnik10mo ago

Yes. In January I would have told you AI tools are bullshit. Today I’m on the $200/month Claude Max plan.

As with anything, your miles may vary: I’m not here to tell anyone that thinks they still suck that their experience is invalid, but to me it’s been a pretty big swing.

2 more replies

hombre_fatal10mo ago

I see a massive jump every time.

Just two years ago, this failed.

> Me: What language is this: "esto está escrito en inglés"

> LLM: English

Gemini and Opus have solved questions that took me weeks to solve myself. And I'll feed some complex code into each new iteration and it will catch a race condition I missed even with testing and line by line scrutiny.

Consider how many more years of experience you need as a software engineer to catch hard race conditions just from reading code than someone who couldn't do it after trying 100 times. We take it for granted already since we see it as "it caught it or it didn't", but these are massive jumps in capability.

ipaddr10mo ago

Wait until the next set. You will find you the previous ones weren't useful after all.

steveklabnik10mo ago

This makes no sense to me. I’m well aware that I’m getting value today, that’s not going to change in the future: it’s already happened.

Sure they may get even more useful in the future but that doesn’t change my present.

bix610mo ago

Everything actually got better. Look at the image generation improvements as an easily visible benchmark.

I do not program for my day job and I vibe coded two different web projects. One in twenty mins as a test with cloudflare deployment having never used cloudflare and one in a week over vacation (and then fixed a deep safari bug two weeks later by hammering the LLM). These tools massively raise the capabilities for sub-average people like me and decrease the time / brain requirements significantly.

I had to make a little update to reset the KV store on cloudflare and the LLM did it in 20s after failing the syntax twice. I would’ve spent at least a few minutes looking it up otherwise.

mwigdahl10mo ago

I've been a proponent for a long time, so I certainly fit this at least partially. However, the combination of Claude Code and the Claude 4 models has pushed the response to my demos of AI coding at my org from "hey, that's kind of cool" to "Wow, can you get me an API key please?"

It's been a very noticeable uptick in power, and although there have been some nice increases with past model releases, this has been both the largest and the one that has unlocked the most real value since I've been following the tech.

achierius10mo ago

Is that really the case vs. 3.7? For me that was the threshold, and since then the improvements have been nice but not as significant.

mwigdahl10mo ago

I would agree with you that the jump from Sonnet 3.7 to Sonnet 4 feels notable but not shocking. Opus 4 is considerably better, and Opus 4 combined with the Claude Code harness is what really unlocks the value for me.

cfst10mo ago

The current batch of models, specifically Claude Sonnet and Opus 4, are the first I've used that have actually been more helpful than annoying on the large mixed-language codebases I work in. I suspect that dividing line differs greatly between developers and applications.

Aeolun10mo ago

It’s true though? Previous models could do well in specifically created settings. You can throw practically everything at Opus, and it’ll work mostly fine.

simonw10mo ago

The previous model retroactively becomes not as good as the best available models. I don't think that's a huge surprise.

cwillu10mo ago

The surprise is the implication that the crossover between net-negative and net-positive impact happened to be in the last 4 months, in light of the initial release 2 years ago and sufficient public attention for a study to be funded and completed.

Yes, it might make a difference, but it is a little tiresome that there's always a “this is based on a model that is x months old!” comment, because it will always be true: an academic study does not get funded, executed, written up, and published in less time.

Ntrails10mo ago

Some of it is just that (probably different) people said the same damn things 6 months ago.

"No, the 2.8 release is the first good one. It massively improves workflows"

Then, 6 months later, the study comes out.

"Ah man, 2.8 was useless, 3.0 really crossed the threshold on value add"

At some point, you roll your eyes and assume it is just snake oil sales

2 more replies

foobarqux10mo ago

That's not the argument being made though, which is that it does "work" now and implying that actually it didn't quite work before; except that that is the same thing the same people say for every model release, including at the time or release of the previous one, which is now acknowledged to be seriously flawed; and including the future one, at which time the current models will similarly be acknowledged to be, not only less performant that the future models, but inherently flawed.

Of course it's possible that at some point you get to a model that really works, irrespective of the history of false claims from the zealots, but it does mean you should take their comments with a grain of salt.

steveklabnik10mo ago

> That's not the argument being made though, which is that it does "work" now and implying that actually it didn't quite work before

Right.

> except that that is the same thing the same people say for every model release,

I did not say that, no.

I am sure you can find someone who is in a Groundhog Day about this, but it’s just simpler than that: as tools improve, more people find them useful than before. You’re not talking to the same people, you are talking to new people each time who now have had their threshold crossed.

1 more reply

pdabbadabba10mo ago

Maybe it's convenient. But isn't it also just a fact that some of the models available today are better than the ones available five months ago?

bryanrasmussen10mo ago

sure, but after having spent some time trying to get anything useful - programmatically - out of previous models and not getting anything once a new one is announced how much time should one spend.

Sure you may end up missing out on a good thing and then having to come late to the party, but coming early to the party too many times and the beer is watered down and the food has grubs is apt to make you cynical the next time a party announcement comes your way.

Terr_10mo ago

Plus it's not even possible to miss the metaphorical party: If it gets going, it will be quite obvious long before it peaks.

(Unless one believes the most grandiose prophecies of a technological-singularity apocalypse, that is.)

Terr_10mo ago

That's not the issue. Their complaint is that proponents keep revising what ought to be fixed goalposts... Well, fixed unless you believe unassisted human developers are also getting dramatically better at their jobs every year.

Like the boy who cried wolf, it'll eventually be true with enough time... But we should stop giving them the benefit of the doubt.

_____

Jan 2025: "Ignore last month's models, they aren't good enough to show a marked increase in human productivity, test with this month's models and the benefits are obvious."

Feb 2025: "Ignore last month's models, they aren't good enough to show a marked increase in human productivity, test with this month's models and the benefits are obvious."

Mar 2025: "Ignore last month's models, they aren't good enough to show a marked increase in human productivity, test with this month's models and the benefits are obvious."

Apr 2025: [Ad nauseam, you get the idea]

pdabbadabba10mo ago

Fair enough. For what it's worth, I've always thought that the more reasonable claim is that AI tools make poor-average developers more productive, not necessarily expert developers.

1 more reply

itsoktocry10mo ago

>the previous model retroactively becomes total dogshit the moment a new one is released

Keep writing your code manually, nobody cares.

player123410mo ago

And nobody will notice.

jstummbillig10mo ago

Convenient for whom and what...? There is nothing tangible to gain from you believing or not believing that someone else does (or does not) get a productivity boost from AI. This is not a religion and it's not crypto. The AI users' net worth is not tied to another ones use of or stance on AI (if anything, it's the opposite).

More generally, the phenomenon this is quite simply explained and nothing surprising: New things improve, quickly. That does not mean that something is good or valuable but it's how new tech gets introduced every single time, and readily explains changing sentiment.

leshow10mo ago

I think you're missing the broader context. There is a lot of people very invested in the maximalist outcome which does create pressure for people to be boosters. You don't need a digital token for that to happen. There's a social media aspect as well that creates a feedback loop about claims.

We're in a hype cycle, and it means we should be extra critical when evaluating the tech so we don't get taken in by exaggerated claims.

jstummbillig10mo ago

I mostly don't agree. Yes, there is always social pressure with these things, and we are in a hype cycle, but the people "buying in" are simply not doing much at all. They are mostly consumers, waiting for the next model, which they have no control over or stake in creating (by and large).

The people not buying into the hype, on the other hands, are actually the ones that have a very good reason to be invested, because if they turn out to be wrong they might face some very uncomfortable adjustments in the job landscape and a lot of the skills that they worked so hard to gain and believed to be valuable.

As always, be weary of any claims, but the tension here is very much the reverse of crypto and I don't think that's very appreciated.

card_zero10mo ago

I saw that edit. Indeed you can't predict that rejecting a new thing is part of a routine of being wrong. It's true that "it's strange and new, therefore I hate it" is a very human (and adorable) instinct, but sometimes it's reasonable.

saturneria10mo ago

It is an even more human reaction when the new strange thing directly threatens to upend and massively change the industry that puts food on your table.

The steam-powered loom was not good for the luddites either. Good for society at large in the long term but all the negative points that a 40 year old knitter in 1810 could make against the steam-powered loom would have been perfectly reasonable and accurate judged on that individual's perspective.

jstummbillig10mo ago

"I saw that edit" lol

1 more reply

grey-area10mo ago

Honestly the hype cycle feels very like crypto, and just like crypto prominent vcs have a lot of money riding on the outcome.

jstummbillig10mo ago

Of course, lot's of hype, but my point is that the reason why is very different and it matters: As an early bc adopter making your believe in bc is super important to my net worth (and you not believing in bc makes me look like an idiot and lose a lot of money).

In contrast, what do I care if you believe in code generation AI? If you do, you are probably driving up pricing. I mean, I am sure that there are people that care very much, but there is little inherent value for me in you doing so, as long as the people who are building the AI are making enough profit to keep it running.

With regards to the VCs, well, how many VCs are there in the world? How many of the people who have something good to say about AI are likely VCs? I might be off by an order of magnitude, but even then it would really not be driving the discussion.

1 more reply

steveklabnik10mo ago

I agree with you, and I think that’s coloring a lot of people’s perceptions. I am not a crypto fan but am an LLM fan.

Every hype cycle feels like this, and some of them are nonsense and some of them are real. We’ll see.

j / k navigate · click thread line to collapse

0 comments

nalllar10mo ago

If you interact with internet comments and discussions as an amorphous blob of people you'll see a constant trickle of the view that models now are useful, and before were useless.

This is a variant on the goomba fallacy: https://englishinprogress.net/gen-z-slang/goomba-fallacy-exp...

steveklabnik10mo ago

Sorry, that’s not my take. I didn’t think these tools were useful until the latest set of models, that is, they crossed the threshold of usefulness to me.

Even then though, “technology gets better over time” shouldn’t be surprising, as it’s pretty common.

mattmanser10mo ago

Do you really see a massive jump?

I'd say they're only marginally better today than they were even 2 years ago.

simonw10mo ago

An LLM that can test the code it is writing and then iterate to fix the bugs turns out to be a huge step forward from LLMs that just write code without trying to then exercise it.

vidarh10mo ago

The jump has been massive.

otabdeveloper410mo ago

> but now I see its reasoning

It's not showing its reasoning. "Reasoning" models are trained to output more tokens in the hope that more tokens means less hallucinations.

It's just a marketing trick and there is no evidence this sort of fake ""reasoning"" actually gives any benefit.

steveklabnik10mo ago

Yes. In January I would have told you AI tools are bullshit. Today I’m on the $200/month Claude Max plan.

As with anything, your miles may vary: I’m not here to tell anyone that thinks they still suck that their experience is invalid, but to me it’s been a pretty big swing.

2 more replies

hombre_fatal10mo ago

I see a massive jump every time.

Just two years ago, this failed.

> Me: What language is this: "esto está escrito en inglés"

> LLM: English

ipaddr10mo ago

Wait until the next set. You will find you the previous ones weren't useful after all.

steveklabnik10mo ago

This makes no sense to me. I’m well aware that I’m getting value today, that’s not going to change in the future: it’s already happened.

Sure they may get even more useful in the future but that doesn’t change my present.

bix610mo ago

Everything actually got better. Look at the image generation improvements as an easily visible benchmark.

I had to make a little update to reset the KV store on cloudflare and the LLM did it in 20s after failing the syntax twice. I would’ve spent at least a few minutes looking it up otherwise.

mwigdahl10mo ago

achierius10mo ago

Is that really the case vs. 3.7? For me that was the threshold, and since then the improvements have been nice but not as significant.

mwigdahl10mo ago

cfst10mo ago

Aeolun10mo ago

It’s true though? Previous models could do well in specifically created settings. You can throw practically everything at Opus, and it’ll work mostly fine.

simonw10mo ago

The previous model retroactively becomes not as good as the best available models. I don't think that's a huge surprise.

cwillu10mo ago

Ntrails10mo ago

Some of it is just that (probably different) people said the same damn things 6 months ago.

"No, the 2.8 release is the first good one. It massively improves workflows"

Then, 6 months later, the study comes out.

"Ah man, 2.8 was useless, 3.0 really crossed the threshold on value add"

At some point, you roll your eyes and assume it is just snake oil sales

2 more replies

foobarqux10mo ago

steveklabnik10mo ago

> That's not the argument being made though, which is that it does "work" now and implying that actually it didn't quite work before

Right.

> except that that is the same thing the same people say for every model release,

I did not say that, no.

1 more reply

pdabbadabba10mo ago

Maybe it's convenient. But isn't it also just a fact that some of the models available today are better than the ones available five months ago?

bryanrasmussen10mo ago

sure, but after having spent some time trying to get anything useful - programmatically - out of previous models and not getting anything once a new one is announced how much time should one spend.

Terr_10mo ago

Plus it's not even possible to miss the metaphorical party: If it gets going, it will be quite obvious long before it peaks.

(Unless one believes the most grandiose prophecies of a technological-singularity apocalypse, that is.)

Terr_10mo ago

Like the boy who cried wolf, it'll eventually be true with enough time... But we should stop giving them the benefit of the doubt.

_____

Jan 2025: "Ignore last month's models, they aren't good enough to show a marked increase in human productivity, test with this month's models and the benefits are obvious."

Feb 2025: "Ignore last month's models, they aren't good enough to show a marked increase in human productivity, test with this month's models and the benefits are obvious."

Mar 2025: "Ignore last month's models, they aren't good enough to show a marked increase in human productivity, test with this month's models and the benefits are obvious."

Apr 2025: [Ad nauseam, you get the idea]

pdabbadabba10mo ago

Fair enough. For what it's worth, I've always thought that the more reasonable claim is that AI tools make poor-average developers more productive, not necessarily expert developers.

1 more reply

itsoktocry10mo ago

>the previous model retroactively becomes total dogshit the moment a new one is released

Keep writing your code manually, nobody cares.

player123410mo ago

And nobody will notice.

jstummbillig10mo ago

leshow10mo ago

We're in a hype cycle, and it means we should be extra critical when evaluating the tech so we don't get taken in by exaggerated claims.

jstummbillig10mo ago

As always, be weary of any claims, but the tension here is very much the reverse of crypto and I don't think that's very appreciated.

card_zero10mo ago

saturneria10mo ago

It is an even more human reaction when the new strange thing directly threatens to upend and massively change the industry that puts food on your table.

jstummbillig10mo ago

"I saw that edit" lol

1 more reply

grey-area10mo ago

Honestly the hype cycle feels very like crypto, and just like crypto prominent vcs have a lot of money riding on the outcome.

jstummbillig10mo ago

1 more reply

steveklabnik10mo ago

I agree with you, and I think that’s coloring a lot of people’s perceptions. I am not a crypto fan but am an LLM fan.

Every hype cycle feels like this, and some of them are nonsense and some of them are real. We’ll see.

j / k navigate · click thread line to collapse