undefined | Better HN

0 pointsBjorkbat1y ago0 comments

I think it's still an interesting way to measure general intellience, it's just that o3 has demonstrated that you can actually achieve human performance on it by training it on the public training set and giving it ridiculous amounts of compute, which I imagine equates to ludicrously long chains-of-thought, and if I understand correctly more than one chain-of-thought per task (they mention sample sizes in the blog post, with o3-low using 6 and o3-high using 1024. Not sure if these are chains-of-thought per task or what).

Once you look at it that way it the approach really doesn't look like intelligence that's able to generalize to novel domains. It doesn't pass the sniff test. It looks a lot more like brute-forcing.

Which is probably why, in order to actually qualify for the leaderboard, they stipulate that you can't use more than $10k more of compute. Otherwise, it just sounds like brute-forcing.

0 comments

BriggyDwiggs421y ago

I disagree. It’s vastly inefficient, but it is managing to actually solve these problems with a vast search space. If we extrapolate this approach into the future and assume that the search becomes better as the underlying model improves, and assume that the architecture grows more efficient, and assume that the type of parallel computing used here grows cheaper, isn’t it possible that this is a lot more than brute-forcing in terms of what it will achieve? In other words, is it maybe just a really ugly way of doing something functionally equivalent to reasoning?

j / k navigate · click thread line to collapse

0 comments

BriggyDwiggs421y ago

j / k navigate · click thread line to collapse