So a cutoff point of August 2025 just before that is a bit unfortunate (I'm sure they'll be newer studies soon).
I've recently given it a go myself, and it certainly doesn't get it right all the time. But I was able to generate AI-assisted code that met my quality standards at roughly the same speed as coding it by hand.
Since then it has been incremental. I would say the big win has been that models degrade more slowly as context grows. This means, especially for heavily vibecoded-from-scratch projects, that you hit the "I don't even know wtf this is anymore" wall way later, maybe never if you're steering things properly.
I think because you can avoid hitting that wall for longer, people see this as radically different. It's debatable whether that's true or not. But in terms of just what the model does, like how it responds to prompts, I genuinely think it is only marginally better. And again, I think benchmarks confirm this, and I quite like Fodor's analysis on benchmarking here[0].
I use these models daily and I try new models out. I think that people over emphasize "model did something different" or "it got it right" when they switch over to a new model as "this is radically better", which I believe is simply a result of cognitive bias / poor measurement.
[0] https://jamesfodor.com/2025/06/22/line-goes-up-large-languag...
I notice on a daily basis even now that it can easily lead to bloat and unnecessary complexity. We will see if it can be fixed by using even stronger models or not.