"RL boosts sampling efficiency but reduces the reasoning capacity boundary."
Perhaps better to put it like this: Given one, or few attempts, RL trained models beat non-RL models. Given many attempts, non-RL models come up with better answers.
1:05:40
the Improvement is attributed to boosting the correct response from Top K
1:05:46
rather than the enhancement of fundamental capabilities this is something that we've come to learn in a
1:05:52
lot of different ways from like reinforcement learning on language
1:05:58
models or even supervised fine-tuning is that what's happening most likely is
1:06:04
more that the capabilities of doing all of these things are already present in
1:06:09
the underlying pre-trained language model
https://www.youtube.com/watch?v=bAWV_yrqx4w&t=1h4mfrom the paper:
> 5.2.2. Why RL Works? > In this paper, we conduct reinforcement learning based on a subset of instruction tuning data, and it achieves significant performance enhancement upon the instruction tuning model. > To further explain why reinforcement learning works. We evaluate the Pass@K and Maj@K accuracy of the Instruct and RL models on two benchmarks. As shown in Figure 7, RL enhances Maj@K’s performance but not Pass@K. These findings indicate that RL enhances the model’s overall performance by rendering the output distribution more robust, in other words, it seems > that the improvement is attributed to boosting the correct response from TopK rather than the enhancement of fundamental capabilities. Similarly, (Wang et al., 2023a) identified a misalignment problem in reasoning tasks within the SFT model, showing that the reasoning performance of SFT models can be improved through a series of preference alignment strategies > (Song et al., 2023; Wang et al., 2023a; Yuan et al., 2023b).
In the video he reads into this that these methods alone may not at all get us over the data wall and are still fundamentally limited by the distribution of the base model they augment.
If you mess around with trained weights you're going to delete some base knowledge, as least the knowledge that is outside of the tasks you RL on.
It would be interesting to know how much of the sampling efficiency improvement from reinforcement learning is due to being better at basic arithmetic (something which could also be achieved by giving the model access to a calculator tool) and how much is due to choosing the correct approach for solving the problem more often.
This is a weak argument. I think I get what we are trying to say, but let's take this to the extreme, say pass@10^10^100. Just like a group of monkeys could write Shakespeare if given enough time, a complete random model could probably outperform an RL-trained model at pass@10^10^100. Would we then say the random model can reason too?
Of course the correct reasoning trace will be in the base model's distribution, just like any other well-formed, coherent paragraph. Kind of makes me think, maybe sampling efficiency _is_ intelligence?
However, if you're training on many problems, it's possible in principle that if you have traction on _any_ of the problems, then the learning signal you get from success on those problems will have a positive effect on the model's behavior on other problems. Ie, the learning that you do on problems where the model is already producing positive reward behavior will nudge the model towards producing positive reward behavior on problems where it wasn't previously doing so.
This is a really good observation. It means that you don't need to RL the full model. You merely need to RL a few LoRAs or maybe a small Mamba model appended to the final layer.
Also fun stuff many don't know - If you run a regular models chat template with a reasoning tuned model, it can go back to acting like the base model, with no "thinking" process.
"Reasoning" models are not any better than non reasoning models. It's a parlor trick, and benchmarks which claimed it wasn't are bad.
Well, of course. They've been "fine-tuned" with specific chat templates. Remove those and the fine-tune doesn't take precedence anymore. That's expected behaviour I'd say.
> "Reasoning" models are not any better than non reasoning models. It's a parlor trick, and benchmarks which claimed it wasn't are bad.
All of them? Including the closed ones, never public? I highly doubt that.