The InstructGPT paper also showed that RLHF made hallucination worse (with more user data rejecting common hallucinations instruction tuning and RLHF may lower specific hallucinations rejected by users though).
Some mention of that here: https://huyenchip.com/2023/05/02/rlhf.html#rlhf_and_hallucin...
Not specifically showing catastrophic forgetting, but hallucination for o3:
> From the results of this evaluation, o3's hallucination rate is 33 percent, and o4-mini's hallucination rate is 48 percent — almost half of the time. By comparison, o1's hallucination rate is 16 percent, meaning o3 hallucinated about twice as often.
https://mashable.com/article/openai-o3-o4-mini-hallucinate-h...Deepseek R1 handles some of this by redistilling back in "factual Q&A" generated from original V3 model to make a new V3. The V3 paper mentions it incorporated an R1 pass too so it seems like: V3 base model, RL pass, V3 with RL distill and retraining a checkpoint for the final V3 release, additional RL pass for the final R1 release.
V3 Paper
> During the post-training stage, we distill the reasoning capability from the DeepSeekR1 series of models [I think that refers to the earlier checkpoint R1 after the first pass below]
R1 Paper:
> To address these issues and further enhance reasoning performance, we introduce DeepSeek-R1, which incorporates a small amount of cold-start data and a multi-stage training pipeline. Specifically, we begin by collecting thousands of cold-start data to fine-tune the DeepSeek-V3-Base model. Following this, we perform reasoning-oriented RL like DeepSeek-R1-Zero. Upon nearing convergence in the RL process, we create new SFT data through rejection sampling on the RL checkpoint, combined with supervised data from DeepSeek-V3 in domains such as writing, factual QA, and self-cognition, and then retrain the DeepSeek-V3-Base model. After fine-tuning with the new data, the checkpoint undergoes an additional RL process, taking into account prompts from all scenarios. After these steps, we obtained a checkpoint referred to as DeepSeek-R1, which achieves performance on par with OpenAI-o1-1217.
In general with fine-tuning you can avoid catastrophic forgetting by mixing in the original data during later fine tuning steps, and from this it seems the same is true of the RL phases, but they are also doing some amount of augmentation and selection on the the data involved.