I don't have personal experience, but there seems to be a broad consensus that Opus 4.5 was tipping point between "kinda bad" and "actually kinda useful".
So a cutoff point of August 2025 just before that is a bit unfortunate (I'm sure they'll be newer studies soon).
That reaction has happened with every model release for the past few years. Maybe they aren’t the same people, but it’s always “old model was terrible, new model gets it right” then “new model was terrible, newer model gets it right,” ad infinitum.
A large proportion of my professional network were in the "AI for code generatin might just be a fad" camp pre Opus 4.5 (and the Codex/Gemini models that came out shortly after that), and now almost everyone seems to think that AI will have at least some place in professional development environments on an ongoing basis.
I've recently given it a go myself, and it certainly doesn't get it right all the time. But I was able to generate AI-assisted code that met my quality standards at roughly the same speed as coding it by hand.
FWIW I am definitely someone who uses AI. I have been using it for a few years now. There's no question that models have improved. I'd say the biggest leap was around the ChatGPT 3.5 -> 4.0, which radically reduced hallucination problems. The big issue of "it just made up a module that doesn't exist" more or less went away at that point. This was the big leap from "spits out text that might help you" to "can produce value".
Since then it has been incremental. I would say the big win has been that models degrade more slowly as context grows. This means, especially for heavily vibecoded-from-scratch projects, that you hit the "I don't even know wtf this is anymore" wall way later, maybe never if you're steering things properly.
I think because you can avoid hitting that wall for longer, people see this as radically different. It's debatable whether that's true or not. But in terms of just what the model does, like how it responds to prompts, I genuinely think it is only marginally better. And again, I think benchmarks confirm this, and I quite like Fodor's analysis on benchmarking here[0].
I use these models daily and I try new models out. I think that people over emphasize "model did something different" or "it got it right" when they switch over to a new model as "this is radically better", which I believe is simply a result of cognitive bias / poor measurement.
I have experience and the gap is exaggerated imo. edit: And I think benchmarks largely support this, and benchmarks are already biased to overstate LLM performance IMO.
I used Claude Code before August 2025 and it was definitely usable, although clearly more capable now. The difference is noticeable but not a completely different world, all in all, in my eyes.
I notice on a daily basis even now that it can easily lead to bloat and unnecessary complexity. We will see if it can be fixed by using even stronger models or not.
So a cutoff point of August 2025 just before that is a bit unfortunate (I'm sure they'll be newer studies soon).