I think a similar claim could be levelled against other benchmarks or LLM evaluation tasks. One could say that the Turing test was designed to assess human intelligence, and LLMs pass it, therefore LLMs have human intelligence. This is generally considered to be false now, because we can plainly see that LLMs do not have intelligence in the same way as humans (yet? debatable, not the point), and instead we concluded that the Turing test was not the right benchmark. That's not to diminish its importance, it was hugely important as a part of AI education and possibly even AI development for decades.
ARC does seem to be pushing the boundaries, I'm just not convinced that it's testing a provable step change.