It's poor performance on benchmarks drives my skepticism of LLM benchmarking in general. I trust my feel for the models much more, and my feel was that 0314 was great.
The one thing that 0314 doesn't do well are the tricks like structured output and tool calling which makes it a less useful agentic type of tool, but from a pure thinking perspective, I think it's the best.