undefined | Better HN

0 pointscma11mo ago0 comments

At least with Deep Seek math (with the same RL technique as the later R1) they noted similar things in their paper in the "Why RL Works?" section. Around the 1:04:00 mark of this Yannic Kilcher video review of the Deepseek math paper he goes over that section and points to basically the same limitations as the hn submission paper, starts at around the 1hr 4m mark and ends with this:

    1:05:40
    the Improvement is attributed to boosting the correct response from Top K
    1:05:46
    rather than the enhancement of fundamental capabilities this is something that we've come to learn in a
    1:05:52
    lot of different ways from like reinforcement learning on language
    1:05:58
    models or even supervised fine-tuning is that what's happening most likely is
    1:06:04
    more that the capabilities of doing all of these things are already present in
    1:06:09
    the underlying pre-trained language model

https://www.youtube.com/watch?v=bAWV_yrqx4w&t=1h4m

from the paper:

> 5.2.2. Why RL Works? > In this paper, we conduct reinforcement learning based on a subset of instruction tuning data, and it achieves significant performance enhancement upon the instruction tuning model. > To further explain why reinforcement learning works. We evaluate the Pass@K and Maj@K accuracy of the Instruct and RL models on two benchmarks. As shown in Figure 7, RL enhances Maj@K’s performance but not Pass@K. These findings indicate that RL enhances the model’s overall performance by rendering the output distribution more robust, in other words, it seems > that the improvement is attributed to boosting the correct response from TopK rather than the enhancement of fundamental capabilities. Similarly, (Wang et al., 2023a) identified a misalignment problem in reasoning tasks within the SFT model, showing that the reasoning performance of SFT models can be improved through a series of preference alignment strategies > (Song et al., 2023; Wang et al., 2023a; Yuan et al., 2023b).

In the video he reads into this that these methods alone may not at all get us over the data wall and are still fundamentally limited by the distribution of the base model they augment.

0 comments

GloamingNiblets11mo ago

Thanks for sharing. I had trouble reading the transcript, so here is Claude's cleaned up version and summary:

Here's the condensed and formatted transcription in a single paragraph: This is the last thing I want to highlight this section on why RL works. Here they evaluate different things - they evaluate specifically pass at K and maj at K. Maj at K is like majority voting, so what you do is you have a model, you have a question, and you output not just one output but an ordered set. So you give your top 20 answers - 0 is your best answer that the model wants to give most, then the second most answer, third most answer, and so on. They could all be correct, just different reformulations of the same answer or different derivations stated in different ways. What you're interested in is how many of the top K results are correct - that's the pass at K. And if you had to vote if majority voting on the top K, how often would you be correct then? There's a slight difference, and that slight difference is actually made more drastic by reinforcement learning. They say, "As shown in figure 7, reinforcement learning enhances majority at K performance but not pass at K." These findings indicate that reinforcement learning enhances the model's overall performance by rendering the output distribution more robust. In other words, it seems that the improvement is attributed to boosting the correct response from Top K rather than the enhancement of fundamental capabilities. This is something we've come to learn in many different ways from reinforcement learning on language models or even supervised fine-tuning - what's happening most likely is that the capabilities of doing all of these things are already present in the underlying pre-trained language model. Summary: Reinforcement learning improves language model performance not by enhancing fundamental capabilities but by making the output distribution more robust, effectively boosting correct responses within the top results rather than improving the model's inherent abilities.

spwa411mo ago

Just don't.

This is a horrible summary. It is both too complex and to simple at the same time. This summary spends about half it's time talking about pass@k while failing to explain what it is and giving a great deal of good-sounding but misleading statements, making me think Claude completely misunderstood (it is absolutely not like majority voting). Pass@k means you get k attempts to answer a question. Right? You passed. Wrong? Well, you've got k (for example 10) attempts.

The paper itself is much better. Hell, the conclusion of the paper is so much better than what you have here.

Here's a decent summary, directly from the paper's conclusion:

1. RL-trained models perform worse than base models in pass@k at large k values. (note that Claude's explanation of what pass@k is in the parent post is extremely wrong)

2. RL boosts sampling efficiency but reduces the reasoning capacity boundary.

3. RLVR algorithms perform similarly and remain far from optimal.

4. RLVR and distillation are fundamentally different.

And here's a one-line summary from me:

This paper claims that RL(VR) training is like improving the model's search algorithm: it becomes (a lot) better at locating a good answer within the model, but also pushes the model too hard to give only this answer.

Before Claude makes another absurd claim RL = reinforcement learning (for example, for safety. Say, trying to get the model to explain breaking into a car, if it ever does, that's bad), RLVR = reinforcement learning with verifiable rewards (meaning you get to think as much as you want, as long as your final answer is correct. But you get to reminisce/think as much as you want before giving a final answer, and that thinking does not have to be relevant)

And a comment: this is exactly what you'd expect to see from mild overtraining of the model. It could be that the current big players are pushing the models to be right/helpful/safe too hard, and taking away too much "freedom" in the process.

GloamingNiblets11mo ago

I appreciate the feedback, another reminder to not lean too much on LLMs.

mountainriver11mo ago

This also seems to be why rejection sampling + SFT seems just as good if not better in a lot of scenarios

whatshisface11mo ago

I don't know a lot about this but it seems like if the sampling performance was adequate, external checks like theorem verification would work to get "over the data wall."

cmaOP11mo ago

There have already been good results there with DeepMind's math Olympiad work. I think the LLM portion there was only for translating from informal to formal in the training process and in the final process they still used a manual translation to a formal description and the solver was transformer based and RL trained, but I think not starting with any language base, but it was able to learn some distribution helpful in solving the problems with RL, verifier,and light scaffolding of the tree search alone.

j / k navigate · click thread line to collapse

0 pointscma11mo ago0 comments

    1:05:40
    the Improvement is attributed to boosting the correct response from Top K
    1:05:46
    rather than the enhancement of fundamental capabilities this is something that we've come to learn in a
    1:05:52
    lot of different ways from like reinforcement learning on language
    1:05:58
    models or even supervised fine-tuning is that what's happening most likely is
    1:06:04
    more that the capabilities of doing all of these things are already present in
    1:06:09
    the underlying pre-trained language model

https://www.youtube.com/watch?v=bAWV_yrqx4w&t=1h4m

from the paper:

In the video he reads into this that these methods alone may not at all get us over the data wall and are still fundamentally limited by the distribution of the base model they augment.

0 comments

GloamingNiblets11mo ago

Thanks for sharing. I had trouble reading the transcript, so here is Claude's cleaned up version and summary:

spwa411mo ago

Just don't.

The paper itself is much better. Hell, the conclusion of the paper is so much better than what you have here.

Here's a decent summary, directly from the paper's conclusion:

1. RL-trained models perform worse than base models in pass@k at large k values. (note that Claude's explanation of what pass@k is in the parent post is extremely wrong)

2. RL boosts sampling efficiency but reduces the reasoning capacity boundary.

3. RLVR algorithms perform similarly and remain far from optimal.

4. RLVR and distillation are fundamentally different.

And here's a one-line summary from me:

GloamingNiblets11mo ago

I appreciate the feedback, another reminder to not lean too much on LLMs.

mountainriver11mo ago

This also seems to be why rejection sampling + SFT seems just as good if not better in a lot of scenarios

whatshisface11mo ago

I don't know a lot about this but it seems like if the sampling performance was adequate, external checks like theorem verification would work to get "over the data wall."

cmaOP11mo ago

j / k navigate · click thread line to collapse