I think it's still an interesting way to measure general intellience, it's just that o3 has demonstrated that you can actually achieve human performance on it by training it on the public training set and giving it ridiculous amounts of compute, which I imagine equates to ludicrously long chains-of-thought, and if I understand correctly more than one chain-of-thought per task (they mention sample sizes in the blog post, with o3-low using 6 and o3-high using 1024. Not sure if these are chains-of-thought per task or what).
Once you look at it that way it the approach really doesn't look like intelligence that's able to generalize to novel domains. It doesn't pass the sniff test. It looks a lot more like brute-forcing.
Which is probably why, in order to actually qualify for the leaderboard, they stipulate that you can't use more than $10k more of compute. Otherwise, it just sounds like brute-forcing.