Evidently, all these models still fall short.
A modern robot would struggle to fold socks and put them in a drawer, but they're great at making cars.
The 2-4-6 game comes to mind. They may well have verified the AI will work, but it's hard to learn the skill of thinking about how to falsify a belief.
I'm talking about regular people, who actually use these tools for productive use, and can tell the models are up to tasks previously unachievable.
And yet... every interface to every LLM has a "ChatGPT can make mistakes. Check important info." style disclaimer.
The hype around this stuff may be deafening, but it's often not entirely the direct fault of the model vendors themselves, who even put out lengthy papers describing their many flaws.
Humans fuck up all the time.