undefined | Better HN

0 pointsvidarh10mo ago0 comments

> And there won't be a point when human curated reward models are not needed anymore.

This doesn't follow at all. There's no reason why a model can not be made to produce reward models.

0 comments

But reward models are always curated by humans. If you generate a reward model with an LLM, it will contain hallucinations that need to be corrected by humans. But that is what a reward model is for. To correct the hallucinations of LLMs.

So yeah theoretically you could generate reward models with LLMs, but they won't be any good, unless they are curated by other reward models that are ultimately curated by humans.

vidarhOP10mo ago

> But reward models are always curated by humans.

There is no inherent reason why they need to be.

> So yeah theoretically you could generate reward models with LLMs, but they won't be any good, unless they are curated by other reward models that are ultimately curated by humans.

This reasoning is begging the question: The reasoning is true only if the conclusion is true. It's therefore a logically invalid argument.

There is no inherent reason why this needs to be the case.

gitaarik10mo ago

Sorry but I don't follow your logic. Are you claiming that reward models that aren't curated by humans perform as well as ones that are?

Then what is a reward model's function according to you?

2 more replies

j / k navigate · click thread line to collapse

0 comments

gitaarik10mo ago

So yeah theoretically you could generate reward models with LLMs, but they won't be any good, unless they are curated by other reward models that are ultimately curated by humans.

vidarhOP10mo ago

> But reward models are always curated by humans.

There is no inherent reason why they need to be.

> So yeah theoretically you could generate reward models with LLMs, but they won't be any good, unless they are curated by other reward models that are ultimately curated by humans.

This reasoning is begging the question: The reasoning is true only if the conclusion is true. It's therefore a logically invalid argument.

There is no inherent reason why this needs to be the case.

gitaarik10mo ago

Sorry but I don't follow your logic. Are you claiming that reward models that aren't curated by humans perform as well as ones that are?

Then what is a reward model's function according to you?

2 more replies

j / k navigate · click thread line to collapse