undefined | Better HN

0 pointsNitpickLawyer9mo ago0 comments

> To my understanding o3, gemini 2.5 pro, claude 3.7 sonnet, etc. are all still aligned to human preferences using a reward function learned from human feedback.

We don't know how the "thinking" models are trained at the big3, but we know that open-source models have been trained with RL. There's no human in that loop. They are aligned based on rewards, and that process is automated.

> Which "coding LLMs" are you referring to that are trained purely on verifiably correct synthetic data?

The "thinking" ones (i.e. oN series, claudeThinking, gemini2.5 pro) and their open-source equivalents - qwq, R1, qwen3, some nemotrons, etc.

From the deepseek paper on R1 we know the model was trained with GRPO, which is a form of RL (reinforcement learning). QwQ and the rest were likely trained in a similar way. (before GRPO, another popular method was PPO. And I've seen work on unsupervised DPO, where the pairs are generated by having a model generate n rollouts, verify them (i.e. run tests) and use that to guide your pair creation)

0 comments

bigmadshoe9mo ago

Sure, it is possible that these models at the big 3 are trained with no human feedback, I personally find it unlikely that they aren't at least aligned with human feedback, which can still introduce a bias in the direction of convincing responses.

You make a fair point that there are alternatives (e.g. DeepSeek r1) which avoid most of the human feedback (my understanding is the model they serve is still aligned by human responses for safety).

I guess I have to do some more reading. I'm a machine learning engineer but don't train LLMs.

j / k navigate · click thread line to collapse