Anthropic p-hacking the benchmark strikes me as cheating, and somewhat unlikely. Mythos figuring out how to cheat at the benchmark strikes me as much more likely.
But if that hypothesis is the explanation the interesting part is Opus 4.7 (but not 4.6) seems to be doing the same.