1:05:40
the Improvement is attributed to boosting the correct response from Top K
1:05:46
rather than the enhancement of fundamental capabilities this is something that we've come to learn in a
1:05:52
lot of different ways from like reinforcement learning on language
1:05:58
models or even supervised fine-tuning is that what's happening most likely is
1:06:04
more that the capabilities of doing all of these things are already present in
1:06:09
the underlying pre-trained language model
https://www.youtube.com/watch?v=bAWV_yrqx4w&t=1h4mfrom the paper:
> 5.2.2. Why RL Works? > In this paper, we conduct reinforcement learning based on a subset of instruction tuning data, and it achieves significant performance enhancement upon the instruction tuning model. > To further explain why reinforcement learning works. We evaluate the Pass@K and Maj@K accuracy of the Instruct and RL models on two benchmarks. As shown in Figure 7, RL enhances Maj@K’s performance but not Pass@K. These findings indicate that RL enhances the model’s overall performance by rendering the output distribution more robust, in other words, it seems > that the improvement is attributed to boosting the correct response from TopK rather than the enhancement of fundamental capabilities. Similarly, (Wang et al., 2023a) identified a misalignment problem in reasoning tasks within the SFT model, showing that the reasoning performance of SFT models can be improved through a series of preference alignment strategies > (Song et al., 2023; Wang et al., 2023a; Yuan et al., 2023b).
In the video he reads into this that these methods alone may not at all get us over the data wall and are still fundamentally limited by the distribution of the base model they augment.