undefined | Better HN

0 pointsblast1y ago0 comments

Everyone is responding to the intellectual property issue, but isn't that the less interesting point?

If Deepseek trained off OpenAI, then it wasn't trained from scratch for "pennies on the dollar" and isn't the Sputnik-like technical breakthrough that we've been hearing so much about. That's the news here. Or rather, the potential news, since we don't know if it's true yet.

0 comments

alecco1y ago

Even if all that about training is true, the bigger cost is inference and Deepseek is 100x cheaper. That destroys OpenAI/Anthropic's value proposition of having a unique secret sauce so users are quickly fleeing to cheaper alternatives.

Google Deepmind's recent Gemini 2.0 Flash Thinking is also priced at the new Deepseek level. It's pretty good (unlike previous Gemini models).

[0] https://x.com/deedydas/status/1883355957838897409

[1] https://x.com/raveeshbhalla/status/1883380722645512275

_cs2017_1y ago

WTF dude, check your source (@deedydas). He seems to be posting garbage. The Gemini 2.0 Flash Thinking price isn't known yet. And on top of that, he gave the wrong number for R1 test results on AIME 2024 (it's 79.8%, far ahead of Gemini rather than far behind.

nightpool1y ago

I mean, Deepseek is currently charging 100x less. That doesn't tell us much about how cheaper it is to run inference on.

fastball1y ago

More like OpenAI is currently charging more. Since R1 is open source / open weight we can actually run it on our own hardware and see what kinda compute it requires.

What is definitely true is that there are already other providers offering DeepSeek R1 (e.g. on OpenRouter[1]) for $7/m-in and $7/m-out. Meanwhile OpenAI is charging $15/m-in and $60/m-out. So already you're seeing at least 5x cheaper inference with R1 vs O1 with a bunch of confounding factors. But it is hard to say anything truly concrete about efficiency OpenAI does not disclose the actual compute required to run inference for O1.

[1] https://openrouter.ai/deepseek/deepseek-r1

2 more replies

blastOP1y ago

> the bigger cost is inference

I didn't know that. Is this always the case?

fcantournet1y ago

Well in the first years of AI no, it wasn't because nobody was using it. But at some point if you want to make money you have to provide a service to users, ideally hundreds of millions of users.

So you can think of training as CI+TEST_ENV and inference as the cost of running your PROD deployments.

Generally in traditional IT infra PROD >> CI+TEST_ENV (10-100 to 1)

The ratio might be quite different for LLM, but still any SUCCESSFUL model will have inference > training at some point in time.

1 more reply

tensor1y ago

That's not correct. First of all, training off of data generated by another AI is generally a bad idea because you'll end up with a strictly less accurate model (usually). But secondly, and more to your point, even if you were to use training data from another model, YOU STILL NEED TO DO ALL THE TRAINING.

Using data from another model won't save you any training time.

dragonwriter1y ago

> training off of data generated by another AI is generally a bad idea

It's...not, and its repeatedly been proven in practice that this is an invalid generalization because it is missing necessary qualifications, and its funny that this myth keeps persisting.

It's probably a bad idea to use uncurated output from another AI to train a model if you are trying to make a better model rather than a distillation of the first model, and its definitely (and, ISTR, the actual research result from which the false generalization has developed) a bad idea to iteratively fine-tune a model on its own unfiltered output, but there has been lots of success using AI models to generate data which is curated and used to train other models, which can be much more efficient that trying to create new material without AI once you've gotten to the point where you've already hoovered up all the readily-accessible low hanging fruit of premade content relevant to your training goal.

LPisGood1y ago

It is, of course not going to produce a “child” model that more accurately predicts the underlying true distribution that the “parent” model was trying to. That is, it will not add anything new.

This is immediately obvious if you look at it through a statistical learning lens and not the mysticism crystal ball that many view NN’s through.

8 more replies

gitaarik1y ago

So 1 + 1 = 3?

bbor1y ago

I think you're missing the point being made here, IMHO: using an advanced model to build high quality training data (whatever that means for a given training paradigm) absolutely would increase the efficiency of the process. Remember that they're not fighting over sounding human, they're fighting over deliberative reasoning capabilities, something that's relatively rare in online discourse.

Re: "generally a bad idea", I'd just highlight "generally" ;) Clearly it worked in this case!

tensor1y ago

It's trivial to build synthetic reasoning datasets, likely even in natural languages. This is a well established technique that works (e.g. see Microsoft Phi, among others).

I said generally because there are things like adversarial training that use a ruleset to help generate correct datasets that work well. Outside of techniques like that it's not just a rule of thumb, it's always true that training on the output of another model will result in a worse model.

https://www.scientificamerican.com/article/ai-generated-data...

1 more reply

smitelli1y ago

> training off of data generated by another AI is generally a bad idea

Ah. So if I understand this... once the internet becomes completely overrun with AI-generated articles of no particular substance or importance, we should not bulk-scrape that internet again to train the subsequent generation of models.

I look forward to that day.

bangaladore1y ago

That's already happened. Its well established now that the internet is tainted. After essentially ChatGPT's public release, a non-insignificant amount of internet content is not written by humans.

tensor1y ago

Yes, this is a real and serious concern that AI researchers have.

fumeux_fume1y ago

I think the point is that if R1 isn't possible without access to OpenAI (at low, subsidized costs) then this isn't really a breakthrough as much as a hack to clone an existing model.

bbor1y ago

R1 is--as far as we know from good ol' ClosedAI--far more efficient. Even if it were a "clone", A) that would be a terribly impressive achievement on its own that Anthropic and Google would be mighty jealous of, and B) it's at the very least a distillation of O1's reasoning capabilities into a more svelte form.

tensor1y ago

The training techniques are a breakthrough no matter what data is used. It's not up for debate, it's an empirical question with a concrete answer. They can and did train orders of magnitude faster.

1 more reply

athrowaway3z1y ago

Thats not right either.

It proofs we _can_ optimize our training data.

Just like humans have been genetically stable for a long time, the quality & structure of information available to a child today vs that of 2000 years ago makes them more skilled at certain tasks. Math being a good example.

sailingparrot1y ago

> First of all, training off of data generated by another AI is generally a bad idea because you'll end up with a strictly less accurate model (usually).

That is not true at all.

We have known how to solve this for at least 2 years now.

All the latest state of the art models depend heavily on training on synthetic data.

bjourne1y ago

https://www.nature.com/articles/s41586-024-07566-y

1 more reply

jjallen1y ago

The DS R1 Model is slightly better though. So how does your statement square with that?

bangaladore1y ago

That's only true if you assume that O1 synthetic data sets are much better than any other (comparably sized) opensource model.

It's not apparently obvious to me that that is the case.

Ie. do you need a SOTA model to produce a new SOTA model?

FooBarWidget1y ago

Have people on HN never heard of public ChatGPT conversations data sets? They've been mentioned multiple times in past HN conversations and I thought it'd be common knowledge here by now. Pretty much all open source models have been training on them for the past 2 years, it's common practice by now. And haven't people been having conversations about "synthetic data" for a pretty long time by now? Why is all of this suddenly an issue in the context of DeepSeek? Nobody made a fuss about this before.

And just because a model trains on some ChatGPT data, doesn't mean that that data is the majority. It's just another dataset.

jondwillis1y ago

But it does mean moat is even less defensible for companies whose fortunes are tied to their foundation models having some performance edge, and a shift in the kinds of hardware used for inference (smaller, closer to the edge.)

jjallen1y ago

That may be true. But an even more interesting point may be that you don’t have to train a huge model ever again? Or at least not to train a new slightly improved model because now we have open weights of an excellent large model and a way to train smaller ones.

fumeux_fume1y ago

This has been in the back of my head since the news broke. Has anyone built their own R1 from scratch and validated it?

RevEng1y ago

In the last few days? No, that would be impossible; no one has the resources to train a base model that quickly. But there are definitely a lot of people working on it.

buyucu1y ago

not the whole model obviously since it just came out. but people have been successful in replicating the core RL principle behind it.

joe_the_user1y ago

If Deepseek trained off OpenAI, then it wasn't trained from scratch for "pennies on the dollar"

If OpenAI trained on the intellectual property of others, maybe it wasn't the creativity breakthrough people claim?

Oppositely

If you say ChatGPT was trained on "whatever data was available", and you say Deepseek was trained "whatever data was available", then they sound pretty equivalent.

All the rough consensus language output of humanity is now roughly on the Internet. The various LLMs have roughly distilled that and the results are naturally going to be tighter and tighter. It's not surprising that companies are going to get better and better at solving the same problem. The situation of DeepSeek isn't so much that promises future achievements but that it shows that OpenAI's string of announcements are incremental progress that aren't going to be reaching the AGI that Altman now often harps on.

el_cujo1y ago

I'm not an OpenAI apologist and don't like what they've done with other people's intellectual property but I think that's kind of a false equivalency. OpenAI's GPT 3.5/4 was a big leap forward in the technology in terms of functionality. DeepSeek-r1 isn't really a huge step forward in output, it's mostly comparable to existing models, one thing that is really cool about it is it being able to be trained from scratch quickly and cheaply. This is completely undercut if it was trained off of OpenAI's data. I don't care about adjudicating which one is a bigger thief, but it's notable if one of the biggest breakthroughs about DeepSeek-r1 is pretty much a lie. And it's still really cool that it's open source and can be run locally, it'll have that over OpenAI whether or not the training claims are a lie/misleading

buzzerbetrayed1y ago

How is it a “lie” for DeepSeek to train their data from ChatGPT but not if they train their data from all of Twitter and Reddit? Either way the training is 100x cheaper.

pertymcpert1y ago

Not just the training cost, the inference cost is a fraction of o1.

philistine1y ago

There’s a question of scale here: was it trained on 1000 outputs or 5 million?

paul_e_warner1y ago

I feel like which one you care about depends on whether you're an AI researcher or an investor.

ohhhhhhhhhk1y ago

Funny how the first principles people now want to claim the opposite of what they’ve been crowing about for decades since techbros climbed their way out of their billion dollar one hit wonders. Boo fucking hoo.

j / k navigate · click thread line to collapse

0 comments

alecco1y ago

Google Deepmind's recent Gemini 2.0 Flash Thinking is also priced at the new Deepseek level. It's pretty good (unlike previous Gemini models).

[0] https://x.com/deedydas/status/1883355957838897409

[1] https://x.com/raveeshbhalla/status/1883380722645512275

_cs2017_1y ago

nightpool1y ago

I mean, Deepseek is currently charging 100x less. That doesn't tell us much about how cheaper it is to run inference on.

fastball1y ago

More like OpenAI is currently charging more. Since R1 is open source / open weight we can actually run it on our own hardware and see what kinda compute it requires.

[1] https://openrouter.ai/deepseek/deepseek-r1

2 more replies

blastOP1y ago

> the bigger cost is inference

I didn't know that. Is this always the case?

fcantournet1y ago

Well in the first years of AI no, it wasn't because nobody was using it. But at some point if you want to make money you have to provide a service to users, ideally hundreds of millions of users.

So you can think of training as CI+TEST_ENV and inference as the cost of running your PROD deployments.

Generally in traditional IT infra PROD >> CI+TEST_ENV (10-100 to 1)

The ratio might be quite different for LLM, but still any SUCCESSFUL model will have inference > training at some point in time.

1 more reply

tensor1y ago

Using data from another model won't save you any training time.

dragonwriter1y ago

> training off of data generated by another AI is generally a bad idea

It's...not, and its repeatedly been proven in practice that this is an invalid generalization because it is missing necessary qualifications, and its funny that this myth keeps persisting.

LPisGood1y ago

This is immediately obvious if you look at it through a statistical learning lens and not the mysticism crystal ball that many view NN’s through.

8 more replies

gitaarik1y ago

So 1 + 1 = 3?

bbor1y ago

Re: "generally a bad idea", I'd just highlight "generally" ;) Clearly it worked in this case!

tensor1y ago

It's trivial to build synthetic reasoning datasets, likely even in natural languages. This is a well established technique that works (e.g. see Microsoft Phi, among others).

https://www.scientificamerican.com/article/ai-generated-data...

1 more reply

smitelli1y ago

> training off of data generated by another AI is generally a bad idea

I look forward to that day.

bangaladore1y ago

That's already happened. Its well established now that the internet is tainted. After essentially ChatGPT's public release, a non-insignificant amount of internet content is not written by humans.

tensor1y ago

Yes, this is a real and serious concern that AI researchers have.

fumeux_fume1y ago

I think the point is that if R1 isn't possible without access to OpenAI (at low, subsidized costs) then this isn't really a breakthrough as much as a hack to clone an existing model.

bbor1y ago

tensor1y ago

The training techniques are a breakthrough no matter what data is used. It's not up for debate, it's an empirical question with a concrete answer. They can and did train orders of magnitude faster.

1 more reply

athrowaway3z1y ago

Thats not right either.

It proofs we _can_ optimize our training data.

sailingparrot1y ago

> First of all, training off of data generated by another AI is generally a bad idea because you'll end up with a strictly less accurate model (usually).

That is not true at all.

We have known how to solve this for at least 2 years now.

All the latest state of the art models depend heavily on training on synthetic data.

bjourne1y ago

https://www.nature.com/articles/s41586-024-07566-y

1 more reply

jjallen1y ago

The DS R1 Model is slightly better though. So how does your statement square with that?

bangaladore1y ago

That's only true if you assume that O1 synthetic data sets are much better than any other (comparably sized) opensource model.

It's not apparently obvious to me that that is the case.

Ie. do you need a SOTA model to produce a new SOTA model?

FooBarWidget1y ago

And just because a model trains on some ChatGPT data, doesn't mean that that data is the majority. It's just another dataset.

jondwillis1y ago

jjallen1y ago

fumeux_fume1y ago

This has been in the back of my head since the news broke. Has anyone built their own R1 from scratch and validated it?

RevEng1y ago

In the last few days? No, that would be impossible; no one has the resources to train a base model that quickly. But there are definitely a lot of people working on it.

buyucu1y ago

not the whole model obviously since it just came out. but people have been successful in replicating the core RL principle behind it.

joe_the_user1y ago

If Deepseek trained off OpenAI, then it wasn't trained from scratch for "pennies on the dollar"

If OpenAI trained on the intellectual property of others, maybe it wasn't the creativity breakthrough people claim?

Oppositely

If you say ChatGPT was trained on "whatever data was available", and you say Deepseek was trained "whatever data was available", then they sound pretty equivalent.

el_cujo1y ago

buzzerbetrayed1y ago

How is it a “lie” for DeepSeek to train their data from ChatGPT but not if they train their data from all of Twitter and Reddit? Either way the training is 100x cheaper.

pertymcpert1y ago

Not just the training cost, the inference cost is a fraction of o1.

philistine1y ago

There’s a question of scale here: was it trained on 1000 outputs or 5 million?

paul_e_warner1y ago

I feel like which one you care about depends on whether you're an AI researcher or an investor.

ohhhhhhhhhk1y ago

j / k navigate · click thread line to collapse