One thing that happened here is that they aren't using current LLMs:
> Most issues were completed in February and March 2025, before models like Claude 4 Opus or Gemini 2.5 Pro were released.
That doesn't mean this study is bad! In fact, I'd be very curious to see it done again, but with newer models, to see if that has an impact.
I've been hearing this for 2 years now
the previous model retroactively becomes total dogshit the moment a new one is released
convenient, isn't it?
If you pay attention to who says it, you'll find that people have different personal thresholds for finding llms useful, not that any given person like steveklabnik above keeps flip-flopping on their view.
This is a variant on the goomba fallacy: https://englishinprogress.net/gen-z-slang/goomba-fallacy-exp...
Even then though, “technology gets better over time” shouldn’t be surprising, as it’s pretty common.
For context, I've been using AI, a mix of OpenAi + Claude, mainly for bashing out quick React stuff. For over a year now. Anything else it's generally rubbish and slower than working without. Though I still use it to rubber duck, so I'm still seeing the level of quality for backend.
I'd say they're only marginally better today than they were even 2 years ago.
Every time a new model comes out you get a bunch of people raving how great the new one is and I honestly can't really tell the difference. The only real difference is reasoning models actually slowed everything down, but now I see its reasoning. It's only useful because I often spot it leaving out important stuff from the final answer.
I do not program for my day job and I vibe coded two different web projects. One in twenty mins as a test with cloudflare deployment having never used cloudflare and one in a week over vacation (and then fixed a deep safari bug two weeks later by hammering the LLM). These tools massively raise the capabilities for sub-average people like me and decrease the time / brain requirements significantly.
I had to make a little update to reset the KV store on cloudflare and the LLM did it in 20s after failing the syntax twice. I would’ve spent at least a few minutes looking it up otherwise.
It's been a very noticeable uptick in power, and although there have been some nice increases with past model releases, this has been both the largest and the one that has unlocked the most real value since I've been following the tech.
Yes, it might make a difference, but it is a little tiresome that there's always a “this is based on a model that is x months old!” comment, because it will always be true: an academic study does not get funded, executed, written up, and published in less time.
Of course it's possible that at some point you get to a model that really works, irrespective of the history of false claims from the zealots, but it does mean you should take their comments with a grain of salt.
Sure you may end up missing out on a good thing and then having to come late to the party, but coming early to the party too many times and the beer is watered down and the food has grubs is apt to make you cynical the next time a party announcement comes your way.
Like the boy who cried wolf, it'll eventually be true with enough time... But we should stop giving them the benefit of the doubt.
_____
Jan 2025: "Ignore last month's models, they aren't good enough to show a marked increase in human productivity, test with this month's models and the benefits are obvious."
Feb 2025: "Ignore last month's models, they aren't good enough to show a marked increase in human productivity, test with this month's models and the benefits are obvious."
Mar 2025: "Ignore last month's models, they aren't good enough to show a marked increase in human productivity, test with this month's models and the benefits are obvious."
Apr 2025: [Ad nauseam, you get the idea]
Keep writing your code manually, nobody cares.
More generally, the phenomenon this is quite simply explained and nothing surprising: New things improve, quickly. That does not mean that something is good or valuable but it's how new tech gets introduced every single time, and readily explains changing sentiment.
We're in a hype cycle, and it means we should be extra critical when evaluating the tech so we don't get taken in by exaggerated claims.