The reduction in hallucinations seems like potentially the biggest upgrade. If it reduces hallucinations by 75% or more over o3 and GPT-4o as the graphs claim, it will be a giant step forward. The inability to trust answers given by AI is the biggest single hurdle to clear for many applications.
Agreed, this is possibly the biggest takeaway to me. If true, it will make a difference in user experience, and benchmarks like these could become the next major target.