85% is just the (semi-arbitrary) threshold for the winning the prize.
o3 actually beats the human average by a wide margin: 64.2% for humans vs. 82.8%+ for o3.
...
Here's the full breakdown by dataset, since none of the articles make it clear --
Private Eval:
- 85%: threshold for winning the prize [1]
Semi-Private Eval:
- 87.5%: o3 (unlimited compute) [2]
- 75.7%: o3 (limited compute) [2]
Public Eval:
- 91.5%: o3 (unlimited compute) [2]
- 82.8%: o3 (limited compute) [2]
- 64.2%: human average (Mechanical Turk) [1] [3]
Public Training:
- 76.2%: human average (Mechanical Turk) [1] [3]
...
References:
[1] https://arcprize.org/guide