We don't know how the "thinking" models are trained at the big3, but we know that open-source models have been trained with RL. There's no human in that loop. They are aligned based on rewards, and that process is automated.
> Which "coding LLMs" are you referring to that are trained purely on verifiably correct synthetic data?
The "thinking" ones (i.e. oN series, claudeThinking, gemini2.5 pro) and their open-source equivalents - qwq, R1, qwen3, some nemotrons, etc.
From the deepseek paper on R1 we know the model was trained with GRPO, which is a form of RL (reinforcement learning). QwQ and the rest were likely trained in a similar way. (before GRPO, another popular method was PPO. And I've seen work on unsupervised DPO, where the pairs are generated by having a model generate n rollouts, verify them (i.e. run tests) and use that to guide your pair creation)