If Deepseek trained off OpenAI, then it wasn't trained from scratch for "pennies on the dollar" and isn't the Sputnik-like technical breakthrough that we've been hearing so much about. That's the news here. Or rather, the potential news, since we don't know if it's true yet.
Google Deepmind's recent Gemini 2.0 Flash Thinking is also priced at the new Deepseek level. It's pretty good (unlike previous Gemini models).
What is definitely true is that there are already other providers offering DeepSeek R1 (e.g. on OpenRouter[1]) for $7/m-in and $7/m-out. Meanwhile OpenAI is charging $15/m-in and $60/m-out. So already you're seeing at least 5x cheaper inference with R1 vs O1 with a bunch of confounding factors. But it is hard to say anything truly concrete about efficiency OpenAI does not disclose the actual compute required to run inference for O1.
I didn't know that. Is this always the case?
So you can think of training as CI+TEST_ENV and inference as the cost of running your PROD deployments.
Generally in traditional IT infra PROD >> CI+TEST_ENV (10-100 to 1)
The ratio might be quite different for LLM, but still any SUCCESSFUL model will have inference > training at some point in time.
Using data from another model won't save you any training time.
It's...not, and its repeatedly been proven in practice that this is an invalid generalization because it is missing necessary qualifications, and its funny that this myth keeps persisting.
It's probably a bad idea to use uncurated output from another AI to train a model if you are trying to make a better model rather than a distillation of the first model, and its definitely (and, ISTR, the actual research result from which the false generalization has developed) a bad idea to iteratively fine-tune a model on its own unfiltered output, but there has been lots of success using AI models to generate data which is curated and used to train other models, which can be much more efficient that trying to create new material without AI once you've gotten to the point where you've already hoovered up all the readily-accessible low hanging fruit of premade content relevant to your training goal.
This is immediately obvious if you look at it through a statistical learning lens and not the mysticism crystal ball that many view NN’s through.
Re: "generally a bad idea", I'd just highlight "generally" ;) Clearly it worked in this case!
I said generally because there are things like adversarial training that use a ruleset to help generate correct datasets that work well. Outside of techniques like that it's not just a rule of thumb, it's always true that training on the output of another model will result in a worse model.
https://www.scientificamerican.com/article/ai-generated-data...
Ah. So if I understand this... once the internet becomes completely overrun with AI-generated articles of no particular substance or importance, we should not bulk-scrape that internet again to train the subsequent generation of models.
I look forward to that day.
It proofs we _can_ optimize our training data.
Just like humans have been genetically stable for a long time, the quality & structure of information available to a child today vs that of 2000 years ago makes them more skilled at certain tasks. Math being a good example.
That is not true at all.
We have known how to solve this for at least 2 years now.
All the latest state of the art models depend heavily on training on synthetic data.
It's not apparently obvious to me that that is the case.
Ie. do you need a SOTA model to produce a new SOTA model?
And just because a model trains on some ChatGPT data, doesn't mean that that data is the majority. It's just another dataset.
If OpenAI trained on the intellectual property of others, maybe it wasn't the creativity breakthrough people claim?
Oppositely
If you say ChatGPT was trained on "whatever data was available", and you say Deepseek was trained "whatever data was available", then they sound pretty equivalent.
All the rough consensus language output of humanity is now roughly on the Internet. The various LLMs have roughly distilled that and the results are naturally going to be tighter and tighter. It's not surprising that companies are going to get better and better at solving the same problem. The situation of DeepSeek isn't so much that promises future achievements but that it shows that OpenAI's string of announcements are incremental progress that aren't going to be reaching the AGI that Altman now often harps on.