> Existing mitigation approaches often degrade performance on open-ended generation and downstream tasks, limiting their practical utility. [...] Unlike continuous reward schemes, our approach assigns a reward of one only when the model's output is entirely factually correct, and zero otherwise.
Someone correct me if I am wrong, as I'm am on the very edge of this space looking in, but does this mean that they are using a "degraded performance with fewer hallucinations" model to fact check the "more powerful yet prone to hallucinations" model?