You are underselling or not understanding the breakthrough. They trained 600B model on 15T tokens for <$6/m. Regardless of the provenance of the tokens, this in itself is impressive.
Not to mention post-training. Their novel GRPO technique used for preference optimization / alignment is also much more efficient than PPO.