There seems to be a maximum amount of reasoning llm’s can do per token (per unit of computation). If you prompt it to use more tokens before it outputs the final answer (think step by step, check your answer, …) it becomes smarter. People have lucked into different prompting strategies to get it to do this, but there probably are more.
Ultimately I feel it is fairer to benchmark llm’s by what they can be prompted into. After all, we let people carefully work through a problem during exams so it seems fair to hold llm’s to the same standard.