"The fact that scaled reasoning models are finally showing progress on ARC proves that what it measures really is relevant and important for reasoning."
Not sure I understand how this follows. The fact that a certain type of model does well on a certain benchmark means that the benchmark is relevant for a real-world reasoning? That doesn't make sense.
It shows objectively that the models are getting better at some form of reasoning, which is at least worth noting. Whether that improved reasoning is relevant for the real world is a different question.