undefined | Better HN

0 pointspatelajay2852y ago0 comments

Yep, DPO is not technically “RL” and implicitly uses the LLM itself as a reward model, but training with DPO is far more stable for that reason.

0 comments

espadrine2y ago

DPO is as close to RL as RLHF. The latter also uses the LLM as a reward model.

I'm not a fan of the RL/SL dichotomy, because the line gets so foggy. If you squint, every loss is a negative reward, and every policy improvement a supervised target.

Still, what the code does isn't what is described in the paper that the page links to.

nextaccountic2y ago

> I'm not a fan of the RL/SL dichotomy, because the line gets so foggy. If you squint, every loss is a negative reward, and every policy improvement a supervised target.

Isn't this just because reinforcement learning and supervised learning are both optimization problems?

espadrine2y ago

In part, yes! But also because what used to define it was the human-curated datasets: SL contained input/output pairs, while RL contained episodes with sporadic rewards.

Nowadays, many datasets have different forms or are synthetic. DPO uses datasets with both positive and negative examples (instead of just a target output as with traditional SL); RLHF uses synthetic rewards.

patelajay285OP2y ago

I tend to agree @espadrine, it's semantics for the most part

j / k navigate · click thread line to collapse

0 comments

espadrine2y ago

DPO is as close to RL as RLHF. The latter also uses the LLM as a reward model.

I'm not a fan of the RL/SL dichotomy, because the line gets so foggy. If you squint, every loss is a negative reward, and every policy improvement a supervised target.

Still, what the code does isn't what is described in the paper that the page links to.

nextaccountic2y ago

> I'm not a fan of the RL/SL dichotomy, because the line gets so foggy. If you squint, every loss is a negative reward, and every policy improvement a supervised target.

Isn't this just because reinforcement learning and supervised learning are both optimization problems?

espadrine2y ago

In part, yes! But also because what used to define it was the human-curated datasets: SL contained input/output pairs, while RL contained episodes with sporadic rewards.

patelajay285OP2y ago

I tend to agree @espadrine, it's semantics for the most part

j / k navigate · click thread line to collapse