I won't be as brutal in my wording, but I agree with the sentiment. This was something drilled into me as someone with a hobby in PC Gaming
and Photography: benchmarks, while handy measures of
potential capabilities, are not
guarantees of real world performance. Very few PC gamers completely reinstall the OS before benchmarking to remove all potential cruft or performance impacts, just as very few photographers exclusively take photos of test materials.
While I appreciate the benchmark and its goals (not to mention the puzzles - I quite enjoy figuring them out), successfully passing this benchmark does not demonstrate or guarantee real world capabilities or performance. This is why I increasingly side-eye this field and its obsession with constantly passing benchmarks and then moving the goal posts to a newer, harder benchmark that claims to be a better simulation of human capabilities than the last one: it reeks of squandered capital and a lack of a viable/profitable product, at least to my sniff test. Rather than simply capitalize on their actual accomplishments (which LLMs are - natural language interaction is huge!), they're trying to prove to Capital that with a few (hundred) billion more in investments, they can make AGI out of this and replace all those expensive humans.
They've built the most advanced prediction engines ever conceived, and insist they're best used to replace labor. I'm not sure how they reached that conclusion, but considering even their own models refute this use case for LLMs, I doubt their execution ability on that lofty promise.