"DeepSeek trained on our outputs and that's not fair because those outputs are ours, and you shouldn't take other peoples' data!" This is obviously extremely silly, because that's exactly how OpenAI got all of its training data in the first place - by scraping other peoples' data off the internet.
"DeepSeek trained on our outputs, and so their claims of replicating o1-level performance from scratch are not really true" This is at least plausibly a valid claim. The DeepSeek R1 paper shows that distillation is really powerful (e.g. they show Llama models get a huge boost by finetuning on R1 outputs), and if it were the case that DeepSeek were using a bunch of o1 outputs to train their model, that would legitimately cast doubt on the narrative of training efficiency. But that's a separate question from whether it's somehow unethical to use OpenAI's data the same way OpenAI uses everyone else's data.
(with the caveat that all we have right now are accusations that DeepSeek made use of OpenAI data - it might just as well turn out that DeepSeek really did work independently, and you really could have gotten o1-like performance with much less compute)
In this study, we demonstrate that reasoning capabilities can be significantly improved through large-scale reinforcement learning (RL), even without using supervised fine-tuning (SFT) as a cold start. Furthermore, performance can be further enhanced with the inclusion of a small amount of cold-start data
Is this cold start data what OpenAI is claiming their output ? If so what's the big deal ?
It is no better for OpenAI in this scenario either, any competitor can easily copy their expensive training without spending the same, i.e. there is a second mover advantage and no economic incentive to be the first one.
To put it another way, the $500 Billion Stargate investment will be worth just $5Billion once the models become available for consumption, because it only will take that much to replicate the same outcomes with new techniques even if the cold start needed o1 output for RL.
Let's just assume that the cost of training can be externalized to other people for free.
The big question really is, are we doing it wrong, could we have created o1 for a fraction of the price. Will o4 cost less to train than o1 did?
The second question is naturally. If we create a smarter LLM, can we use it to create another LLM that is even smarter?
It would have been fantastic if DeepSeek could have come out with an o3 competitor before o3 even became publicly available. That way we would have known for sure that we’re doing it wrong. Cause then either we could have used o1 to train a better AI or we could have just trained in a smarter and cheaper way.
Whether or not you could have, you can now.
All of this should have been clear anyway from the start, but that's the Internet for you.
Hmm, I think the narrative of the rise of LLMs is that once the output of humans has been distilled by the model, the human isn't necessary.
As far as I know, DeepSeek adds only a little to the transformers model while o1/o3 added a special "reasoning component" - if DeepSeek is as good as o1/o3, even taking data from it, then it seems the reasoning component isn't needed.
I did not think this, nor did I think this was what others assumed. The narrative, I thought, was that there is little point in paying OpenAI for LLM usage when a much cheaper, similar / better version can be made and used for a fraction of the cost (whether it's on the back of existing LLM research doesn't factor in)
But HOW they are necessary is the change. They went from building blocks to stepping stones. From a business standpoint that's very damaging to OAI and other players.
And is this related to the lottery ticket hypothesis?
I have a question (disclaimer: reinforcement learning noob here):
Is there a risk of broken telephone with this?
Kinda like repeatedly compressing an already compressed image eventually leads to a fuzzy blur.
If that is the case then I’m curious how this is monitored and / or mitigated.
That is where artificial intelligence is going. Copy things from other things. Will there be a AI Eureka moment where it deviates and knows where and why the reason it is wrong?
It seems like if they in fact distilled then what we have found is that you can create a worse copy of the model for ~5m dollars in compute by training on its outputs.
Better benchmark scores can be cooked
But if you leave someone in the tech industry of SV/SF long enough, they'll start to get high on their own supply and think they're entitled to insane amounts of value, so...
Look at the whole AI revolution that Meta and others have bootstrapped by opening their models. Meanwhile OpenAI/Microsoft, Antropic, Google and the rest are just trying to look after number 1 while trying to regulatory capture an AI for me but not for thee outcome of full control.
Thieves yelling 'stop those thieves' scenario to me, they just were first and would not like losing that position. But its all about money and consequently power, business as usual.
The comments were moved here by dang from an flagged article with an editorialized /clickbait title. That flagged post has 1300 points at the time of writing.
https://news.ycombinator.com/item?id=42865527
1.
It should be incumbent on the moderator to at least consider that the motivation for the points and comments may have been because many thought the "hypocrisy" of OpenAI's position was a more important issue than OpenAI's actual claim of DeepSeek violating its ToS. Moving the comments to an article that buries the potential hypocrisy issue that may have driven the original points and comments is not ideal.
2.
This article is from FT, which has a content license deal with OpenAI. To move the comments to an article from a company that has a conflict of interest due to its commercial relations with the YC company in question is problematic here especially since dang often states they try to more hands-off on moderation when the article is about a YC company.
3.
There is a link by dang to this thread from the original thread, but there should also be a link by dang to the original thread from here as well. Why is this not the case?
4.
Ideally, dang should have asked for a more substantial submission that prioritized the hypocrisy point to better match the spirit of the original post instead of moving the comments to this article.
IANAL, but It is worth noting here that DeepSeek has explicitly consented to a license that doesn't allow them to do this. That is a condition of using the Chat GPT and the OpenAI API.
Even if the courts affirm that there's a fair use defence for AI training, DeepSeek may still be in the wrong here, not because of copyright infringement, but because of a breach of contract.
I don't think OpenAI would have much of a problem if you train your model on data scraped from the internet, some of which incidentally ends up being generated by Chat GPT.
Compare this to training AI models on Kindle Books randomly scraped off the internet, versus making a Kindle account, agreeing to the Kindle ToS, buying some books, breaking Amazon's DRM and then training your AI on that. What DeepSeek did is more analogous to the latter than the former.
You actually don’t know this. Even if it were true that they used OpenAI outputs (and I’m very doubtful) it’s not necessary to sign an agreement with OpenAI to get API outputs. You simply acquire them from an intermediary, so that you have no contractual relationship with OpenAI to begin with.
I have some news for you
By existing in USA, OpenAI consented to comply with copyright law, and how did that go?
OpenAI can't have it both ways
Like I’ve said time and time again, nobody in this space gives a fuck about anyone that isn’t directly contributing money to their bottom line at that particular instant. The fundamental idea is selfish, damages the fundamental machinery that makes the internet useful by penalizing people that actually make things, and will never, ever do anything for the greater good if it even stands a chance of reducing their standing in this ridiculously overhyped market. Giving people free access to what is for all intents and purposes a black box is not “open” anything, is no more free (as in speech) than Slack is, and all of this is obviously them selling a product at a huge loss to put competing media out of business and grab market share.
But in all reality I'm happy to see this day. The fact that OpenAI ripped off everyone and everything they could and, to this day pretend like they didn't, is fantastic.
Sam Altman is a con and it's not surprising that given all the positive press DeepSeek got that it was a full court assault on them within 48 hours.
But IANAL, so if you have a citation that says otherwise I'd be happy to see it!
I hope voters and governments put a long-overdue stop to this cancer of contract-maximalism that has given us such benefits as mandatory arbitration, anti-benchmarking, general circumvention of consumer rights, or, in this case, blatantly anti-competitive terms, by effectively banning reverse-engineering (i.e. examining how something works, i.e. mandating that we live in ignorance).
Because if they don't, laws will slowly become irrelevant, and our lives governed by one-sided contracts.
So no, it doesn't belong to OpenAI.
You might be able to sue for penalties for breach of contract of the TOS, but that doesn't give them the right to the model. And even if it doesn't give them any right to invalidate unbound copyright grants they have given to 3rd parties (here literally everyone). Nor does it prevent anyone from training their own new models based on it or prevent anyone from using it. Oh, and the one breaching the TOS might not even have been the company behind DeepSeek but some in-between 3rd party.
Naturally this is under a few assumptions:
- the US consistently applies it's own law, but they have a long history of not doing so
- the US doesn't abuse their power to force their economical opinions (ban DeepSeek) on other countries
- it actually was trained on OpenAI, but uh, OpenAI has IMHO shown over the years very clearly that they can't be trusted and they are fully in-transparent. How do we trust their claim? How do we trust them to not retrospectively have tweaked their model to make it look as if DeepSeek copied it?
The US ruled that the AI cannot be the author, that doesn't lead like so many clickbait articles suggest, that no AI products can be copyrighted.
1 Activist tried to get the US copyright office to acknowledge his LLM as the author, who would then provide him a license to the work.
There was no issue with himself being the original author and copyright holder of the AI works. But thats not what was being challenged.
I'm wondering how Deepseek could have made 100s of millions of training queries to OpenAI and not one person at OpenAI caught on.
Now, DeepSeek may (or may not) have used some O1 generated data for the R0 RL training, but if so that's just a cost saving vs having to source some reasoning data some other way, and in no way reduces the legitimacy of what they accomplished (which is not something any of the AI CEOs are saying).
OpenAI has also invested heavily in human annotation and RLHF. If all DeepSeek wanted was a proxy for scraped training data, they'd probably just scrape it themselves. Using existing RLHF'd models as replacement for expensive humans in the training loop is the real game changer for anyone trying to replicate these results.
That's like the mafia complaining that they worked so hard to steal those barrels of beer that someone made off with in the middle of the night and really that's not fair and won't someone do something about it?
Besides deals with insurance companies and governments, one of the ways that they are still able to pull this is convincing everyone that it's too dangerous to play with this at home or buying it from an Asian supplier.
At least with software we had until now a way to build and run most things without requiring dedicated super expensive equipment. OpenAI pulled a big Pharma move but hopefully there will be enough disruptors to not let them continue it.
And if DeepSeek had a mole, why would they bother running a massive job internally to steal the data generated? It would be way easier for the mole to just leak the RL training process, and DeepSeek could quietly copy it rather than bothering with exfiltrating massive datasets to distill. The training process is most likely like, on the order of a hundred lines of Python or so, and you don't even need the file: you just need someone to describe it to you. Much simpler than snatching hundreds of gigabytes of training data off of internal servers...
Plus, the RL process described in DeepSeek's paper has already been replicated by a PhD student at Berkeley: https://x.com/karpathy/status/1884678601704169965 So, it seems pretty unlikely they simply distilled R1 and lied about it, or else how does their RL training algo actually... work?
This is mainly cope from OpenAI that their supposedly super duper advanced models got caught by China within a few months of release, for way cheaper than it cost OpenAI to train.
Someone has to correct me if I'm wrong, but I believe in ML research you always have a dataset and a model. They are distinct entities. It is plausible that output from OpenAI's model improved the quality of DeepSeek's dataset. Just like everyone publishing their code on GitHub improved the quality of OpenAI's dataset. What has been the thinking so far is that the dataset is not "part of" or "in" the model any more than the GPUs used to train the model are. It seems strange that that thinking should now change just because Chinese researchers did it better.
OpenAI has a message they need to tell investors right now: "DeepSeek only works because of our technology. Continue investing in us."
The choice of how they're wording that of course also tells you a lot about who they think they're talking to: namely, "the Chinese are unfairly abusing American companies" is a message that is very popular with the current billionaires and American administration.
The above OpenAI quote from the article leans heavily towards #1 and IMO not at all towards #2. The later would be an extremely charitable reading of their statement.
It’s going to shift the market of how foundation models are used. Companies creating models will be incentivized to vertically integrate, owning the full stack of model usage. Exposing powerful models via APIs just lets a competitor clone your work. In a way OpenAI’s Operator is a hint of what’s to come
Some may view this as partially true, given that o-1 does not output its CoT process.
Whatever that means. The legal system right now in shambles and flat footed.
Knowing our current government leadership, I think we’re going to see some brute force action backed up by the United States military.
Even if they didn’t directly, intentionally use o1 output (and they didn’t claim they didn’t, so far as I know), AI slop is everywhere. We passed peak original content years ago. Everything is tainted and everything should be understand in that context.
In relative terms, that's obviously and most definitely true.
In absolute terms, that's obviously and most definitely false.
That's honestly such a academic point, who really cares?
They've been outcompeted and the argument is 'well if we didn't let people access our models, they would of taken longer to get here' so what??
The only thing this gets them is an explanation as to why training o1 cost them more than 5 million or whatever, but that is in the past the datacentre has consumed the energy.. the money has gone up in fairly literal steam.
That being said, breaching OAI's systems, re-training a better model on top of their closed source model, then open sourcing it: That's more Robinhood than Villain I'd say.
The Chinese Communist party very much sees itself in a global rivalry over "new productive forces". That's official policy. And US leadership basically agrees.
The US is playing dirty by essentially embargoing China over big AI - why wouldn't it occur to them to retaliate by playing dirtier?
I mean we probably won't know for sure, but it's much less far fetched than a lot of other speculation in this area.
E.g., R1's cold start training could probably have benefited quite a bit from having access to OpenAI's chain of thought data for training. The paper is a bit light on detail on how it was made.
Meanwhile, they have access to Meta models and Qwen. And Meta models are very easy to run and there's plenty of published work on them. Occam's Razor.
IMHO the whole world is becoming crazy for a lot of reasons, and pissing off billionaires makes me laugh.
Cheapening a series of fact checkable innovations because of the country of origin when so far all that they have showed are signs of good faith is paranoid at best and propaganda to support the billionaire tech lords saving face for their own arrogance at worst.
The word "our" does a lot of heavy lifting in politics[0]. America is not a commune, it's a country club, one which we used to own but have been bought out of, and whose new owners view us as moochers but can't actually kick us out (yet). It is in competition with another, worse country club that purports to be a commune. We owe neither country club our loyalty, so when one bloodies the other's nose, I smile.
[0] Some languages have a notion of an "exclusive we". If English had such a concept, this would be an exclusive our.