I am very curious whether omitting the KL penalty helps on narrow domains like this, and also whether doing so results in illegible reasoning. (From the samples in the post, it looks like it doesn't make reasoning illegible?)
>the 32B model’s response lengths collapsing, especially after reaching peak performance.
I would not have predicted this. Nor that it could collapse its response length to near zero yet lose only a few percentage points of accuracy. If you do SFT to get a model of the same size to solve these puzzles with no reasoning (just output answers directly), how good can it do?
As for response length, I think the model internalizes the logic and doesn't deliberate its answers through context creation. I don't think this is necessarily good for general reasoning, but for a specific task it would cut down inference costs. Just depends on what you're optimizing for. To encourage more general reasoning, I think a broader train and validation set would be helpful.
Apparently DeepSeek-R1 can switch between English, Chinese, and gibberish, and even the gibberish helps it think! That's fascinating, but all I can find is people saying it, nobody showing it.
https://gr.inc/question/although-a-few-years-ago-the-fundame...
In the dropdown set to DeepSeek-R1, switch to the LIMO model (which apparently has a high frequency of language switching).
I'm not sure about examples of gibberish or totally illegible reasoning. My guess is that since R1-Zero still had the KL penalty, it should all be somewhat legible - the KL penalty encourages the model to not move too far from what the base model would say in any given context.
I think there's still a lot to learn about reward functions (saw a team work w/ just correct output, and nothing else), if you should reward partial success (i.e. code compiles / math outputs a result) or just the final thing (i.e. test cases pass / correct answer) and so on.
Not to mention how to get downstream signals from e2e tasks (i.e. if an "agent" navigates to the correct webpage and finds a "cookie" or something, figure out how to reward all the intermediary steps out of that single binary signal).
And there's a lot to learn in using grammars & stuff w/ RL as well. The problem there is that the libraries are pretty wonky atm, some things work, some things need work, and RL in itself is pretty slow due to having to generate, update the model and generate again.
From what I can see (not rigorous): Claude 3.7 fails, ChatGPT with reasoning succeeds, DeepSeek with reasoning succeeds.
But of course the best way for a model to solve a problem like this is to translate it into a constraint satisfaction problem, and write out Python code to call a CSP solver.
Which means that when you asked it (e.g.) whether A is better than B (as a Decision Support System), it should write a program to decide it instead of "guessing it" from the network.
You are stating that, since the issue is general, LLMs should write programs to produce their own outputs, instead of their standard output.
I'm not sure what that means specifically. I don't agree overall. Only certain types of problems encountered by LLMs map cleanly to well-understood problems where existing solvers are perfect.
“ We discovered that meaningful performance improvements, as high as 10–15%, can be achieved with as few as 16 training examples.”
In particular, did you need to change the hyperparameters much, and did this limited recipe show different improvements for the larger vs smaller models? Also, how did you select these 16 examples?
We only tested this with the 14B model. You can see the run here:
https://wandb.ai/bradhilton/rl-experiments/runs/062
Performance peaked after 21 iterations at 45% accuracy instead of the final 59%, but still a significant increase on very few samples.
Something interesting I noticed in the responses was that for shorter puzzles it would make deductions, building up a set additional "clues" for itself, before answering the question. However, for harder puzzles with more clues it would often merely repeat all the given clues and then try to directly answer the questions.
Maybe some form of curriculum learning would help, starting with easier puzzles and progressing to more challenging ones.
Other ideas to explore include:
- Distilling responses from stronger models - Encouraging exploration with entropy regularization or reward shaping - Training from base models instead of instruct models, like DeepSeek-R1-Zero
I think there's a lot of benefit to discovering a training regime that allows small specialized models to do extremely well in one narrow task; if we can figure out how to make small models that beat SOTA on a specific task and are cheap to train and run, that's in some ways a more useful outcome than a very large model that is good at many tasks (but is more expensive to run for each of them).
The results of this paper would indicate doing what they did, but online could return better results
With our training recipe this can be easily done by accumulating the gradients across the entire batch and only doing one step with optimizer before sampling more responses.
In our experiments, however, we found the advantages of doing multiple gradient steps outweighed any potential drift in policy.
Ultimately the online-ness of data is on a spectrum and while more online data is better, other factors may be more important.
Bit pedantic, but amusing thought; wouldn't that imply that asynchronous actor critic is an offline training methodology?
Then group relative advantages are calculated. If you have 16 different responses and the average accuracy is 0.5, then you subtract that from each reward and divide by the standard deviation. Say it's also 0.25. Then the advantage for our example would be (0.25 - 0.5) / 0.25 = -1.
The advantages are then used to increase (or decrease) the probability of sampling those tokens again. Since our example was negative, we penalize the model for underperforming with that response.
Would be super interesting to see which one is more data-efficient!