undefined | Better HN

0 pointsdeepGem1y ago0 comments

From the R1 paper

In this study, we demonstrate that reasoning capabilities can be significantly improved through large-scale reinforcement learning (RL), even without using supervised fine-tuning (SFT) as a cold start. Furthermore, performance can be further enhanced with the inclusion of a small amount of cold-start data

Is this cold start data what OpenAI is claiming their output ? If so what's the big deal ?

0 comments

Imnimo1y ago

DeepSeek claims that the cold-start data is from DeepSeekV3, which is the model that has the $5.5M pricetag. If that data were actually the output of o1 (a model that had a much higher training cost, and its own RL post-training), that would significantly change the narrative of R1's development, and what's possible to build from scratch on a comparable training budget.

TheGeminon1y ago

In the paper DeepSeek just says they have ~800k responses that they used for the cold start data on R1, and are very vague about how they got it:

> To collect such data, we have explored several approaches: using few-shot prompting with a long CoT as an example, directly prompting models to generate detailed answers with reflection and verification, gathering DeepSeek-R1-Zero outputs in a readable format, and refining the results through post-processing by human annotators.

Imnimo1y ago

My surface-level reading of these two sections is that the 800k samples come from R1-Zero (i.e. "the above RL training") and V3:

>We curate reasoning prompts and generate reasoning trajectories by performing rejection sampling from the checkpoint from the above RL training. In the previous stage, we only included data that could be evaluated using rule-based rewards. However, in this stage, we expand the dataset by incorporating additional data, some of which use a generative reward model by feeding the ground-truth and model predictions into DeepSeek-V3 for judgment.

>For non-reasoning data, such as writing, factual QA, self-cognition, and translation, we adopt the DeepSeek-V3 pipeline and reuse portions of the SFT dataset of DeepSeek-V3. For certain non-reasoning tasks, we call DeepSeek-V3 to generate a potential chain-of-thought before answering the question by prompting.

The non-reasoning portion of the DeepSeek-V3 dataset is described as:

>For non-reasoning data, such as creative writing, role-play, and simple question answering, we utilize DeepSeek-V2.5 to generate responses and enlist human annotators to verify the accuracy and correctness of the data.

I think if we were to take them at their word on all this, it would imply there is no specific OpenAI data in their pipeline (other than perhaps their pretraining corpus containing some incidental ChatGPT outputs that are posted on the web). I guess it's unclear where they got the "reasoning prompts" and corresponding answers, so you could sneak in some OpenAI data there?

1 more reply

joe_the_user1y ago

It's like the claim "they showed anyone create a powerful from scratch" becomes "false yet true".

Maybe they needed OpenAI for their process. But now that their model is open source, anyone can use that as their cold start and spend the same amount.

"From scratch" is a moving target. No one who makes their model with massive data from the net is really doing anything from scratch.

bmicraft1y ago

Yeah, but that kills the implied hope of building a better model for cheaper. Like this you'll always have a ceiling of being a bit worse then the openai models.

3 more replies

Loic1y ago

Not for me. As I build a chemical factory, I do not reinvent everything.

They are using the current SOTA tools and models to build new models for cheaper.

vlovich1231y ago

If R1 were better than O1, yes you would be right. But the reporting I’ve seen is that it’s almost as good. Being able to copy cutting edge models won’t advance the state of the art in terms of intelligence. They have made improvements in other area, but if they reused O1 to train their model, that would be effectively a ctrl-c / ctrl-v strictly in terms of task performance.

2 more replies

powerapple1y ago

I lean on the idea that R1-Zero was trained from cold start, at the same time, they have tried many things including using OpenAI APIs. These things can happen in parallel.

j / k navigate · click thread line to collapse

0 comments

Imnimo1y ago

TheGeminon1y ago

In the paper DeepSeek just says they have ~800k responses that they used for the cold start data on R1, and are very vague about how they got it:

Imnimo1y ago

My surface-level reading of these two sections is that the 800k samples come from R1-Zero (i.e. "the above RL training") and V3:

The non-reasoning portion of the DeepSeek-V3 dataset is described as:

1 more reply

joe_the_user1y ago

It's like the claim "they showed anyone create a powerful from scratch" becomes "false yet true".

Maybe they needed OpenAI for their process. But now that their model is open source, anyone can use that as their cold start and spend the same amount.

"From scratch" is a moving target. No one who makes their model with massive data from the net is really doing anything from scratch.

bmicraft1y ago

Yeah, but that kills the implied hope of building a better model for cheaper. Like this you'll always have a ceiling of being a bit worse then the openai models.

3 more replies

Loic1y ago

Not for me. As I build a chemical factory, I do not reinvent everything.

They are using the current SOTA tools and models to build new models for cheaper.

vlovich1231y ago

2 more replies

powerapple1y ago

I lean on the idea that R1-Zero was trained from cold start, at the same time, they have tried many things including using OpenAI APIs. These things can happen in parallel.

j / k navigate · click thread line to collapse