No they haven't, these results do not generalize, as mentioned in the article:
"Furthermore, early data points suggest that the upcoming ARC-AGI-2 benchmark will still pose a significant challenge to o3, potentially reducing its score to under 30% even at high compute"
Meaning, they haven't solved AGI, and the task itself do not represent programming well, these model do not perform that well on engineering benchmarks.