The article says the ensemble of Kaggle solutions (aggregated in some unexplained way) achieves 81%. This is better than their average Mechanical Turk worker, but worse than their average STEM grad. It's better than tuned o3 with low compute, worse than tuned o3 with high compute.
There's also a point on the figure marked "Kaggle SOTA", around 60%. I can't find any explanation for that, but I guess it's the best individual Kaggle solution.
The Kaggle solutions would probably score higher with more compute, but nobody has any incentive to spend >$1M on approaches that obviously don't generalize. OpenAI did have this incentive to spend tuning and testing o3, since it's possible that will generalize to a practically useful domain (but not yet demonstrated). Even if it ultimately doesn't, they're getting spectacular publicity now from that promise.