I think the actual guessing space for these free response problems is much smaller, through simple priors over the question. For example:
“Richard, Jerry, and Robert are going to share 60 cherries. If Robert has 30 cherries, and has 10 more than Richard, how many more cherries does Robert have than Jerry?”
A rudimentary model will likely already know the answer is between 0-60.
Knowing that the answer involves addition and subtraction narrows it down to maybe 8 answers.
While SAT problems have only 4 answers, there’s usually one trick/trap answer, which I think might be be difficult for a model to not accidentally guess. The analogy I can think of is sometimes it’s better to cover up the answers first and work out a solution, to not get biased by any particular answer choice.