undefined | Better HN

0 pointsImnimo1y ago0 comments

I think there's two different things going on here:

"DeepSeek trained on our outputs and that's not fair because those outputs are ours, and you shouldn't take other peoples' data!" This is obviously extremely silly, because that's exactly how OpenAI got all of its training data in the first place - by scraping other peoples' data off the internet.

"DeepSeek trained on our outputs, and so their claims of replicating o1-level performance from scratch are not really true" This is at least plausibly a valid claim. The DeepSeek R1 paper shows that distillation is really powerful (e.g. they show Llama models get a huge boost by finetuning on R1 outputs), and if it were the case that DeepSeek were using a bunch of o1 outputs to train their model, that would legitimately cast doubt on the narrative of training efficiency. But that's a separate question from whether it's somehow unethical to use OpenAI's data the same way OpenAI uses everyone else's data.

0 comments

riantogo1y ago

Why would it cast any doubt? If you can use o1 output to build a better R1. Then use R1 output to build a better X1... then a better X2.. XN, that just shows a method to create better systems for a fraction of the cost from where we stand. If it was that obvious OpenAI should have themselves done. But the disruptors did it. It hindsight it might sound obvious, but that is true for all innovations. It is all good stuff.

ImnimoOP1y ago

I think it would cast doubt on the narrative "you could have trained o1 with much less compute, and r1 is proof of that", if it turned out that in order to train r1 in the first place, you had to have access to bunch of outputs from o1. In other words, you had to do the really expensive o1 training in the first place.

(with the caveat that all we have right now are accusations that DeepSeek made use of OpenAI data - it might just as well turn out that DeepSeek really did work independently, and you really could have gotten o1-like performance with much less compute)

deepGem1y ago

From the R1 paper

In this study, we demonstrate that reasoning capabilities can be significantly improved through large-scale reinforcement learning (RL), even without using supervised fine-tuning (SFT) as a cold start. Furthermore, performance can be further enhanced with the inclusion of a small amount of cold-start data

Is this cold start data what OpenAI is claiming their output ? If so what's the big deal ?

1 more reply

manquer1y ago

> you had to do the really expensive o1 training in the first place

It is no better for OpenAI in this scenario either, any competitor can easily copy their expensive training without spending the same, i.e. there is a second mover advantage and no economic incentive to be the first one.

To put it another way, the $500 Billion Stargate investment will be worth just $5Billion once the models become available for consumption, because it only will take that much to replicate the same outcomes with new techniques even if the cold start needed o1 output for RL.

1 more reply

MrLeap1y ago

o1 wouldn't exist without the combined compute of every mind that led to the training data they used in the first place. How many h100 equivalents are the rolling continuum of all of human history?

1 more reply

vkou1y ago

If OpenAi had to account for the cost of producing all the copyrighted material they trained their LLM on, their system would be worth negative trillions of dollars.

Let's just assume that the cost of training can be externalized to other people for free.

1 more reply

hmottestad1y ago

At the pace that DeepSeek is developing we should expect them to surpass OpenAI in not that long.

The big question really is, are we doing it wrong, could we have created o1 for a fraction of the price. Will o4 cost less to train than o1 did?

The second question is naturally. If we create a smarter LLM, can we use it to create another LLM that is even smarter?

It would have been fantastic if DeepSeek could have come out with an o3 competitor before o3 even became publicly available. That way we would have known for sure that we’re doing it wrong. Cause then either we could have used o1 to train a better AI or we could have just trained in a smarter and cheaper way.

1 more reply

cherry_tree1y ago

> I think it would cast doubt on the narrative "you could have trained o1 with much less compute, and r1 is proof of that"

Whether or not you could have, you can now.

SpaceManNabs1y ago

My question is if deepseek r1 is just a distilled o1, i wonder if you can build a fine tuned r1 through distillation without having to fine tune o1.

zombiwoof1y ago

Exactly. They piggybacked of lots of compute and used less. There still is a total sum of a massive amount of compute

3 more replies

philipwhiuk1y ago

You mean to create an apple pie from scratch you first have to invent the universe?

rockemsockem1y ago

I think the prevailing narrative ATM is that DeepSeek's own innovation was done in isolation and they surpassed OpenAI. Even though in the paper they give a lot of credit to Llama for their techniques. The idea that they used o1's outputs for their distillation further shows that models like o1 are necessary.

All of this should have been clear anyway from the start, but that's the Internet for you.

joe_the_user1y ago

The idea that they used o1's outputs for their distillation further shows that models like o1 are necessary.

Hmm, I think the narrative of the rise of LLMs is that once the output of humans has been distilled by the model, the human isn't necessary.

As far as I know, DeepSeek adds only a little to the transformers model while o1/o3 added a special "reasoning component" - if DeepSeek is as good as o1/o3, even taking data from it, then it seems the reasoning component isn't needed.

2 more replies

aprilthird20211y ago

> the prevailing narrative ATM is that DeepSeek's own innovation was done in isolation and they surpassed OpenAI

I did not think this, nor did I think this was what others assumed. The narrative, I thought, was that there is little point in paying OpenAI for LLM usage when a much cheaper, similar / better version can be made and used for a fraction of the cost (whether it's on the back of existing LLM research doesn't factor in)

4 more replies

hmmm-i-wonder1y ago

>shows that models like o1 are necessary.

But HOW they are necessary is the change. They went from building blocks to stepping stones. From a business standpoint that's very damaging to OAI and other players.

KingOfCoders1y ago

OpenAI couldn't do it, when the high cost of training and access to GPUs is their competitive advance against startups, they can't admit that it does not exist.

patcon1y ago

Are we it rediscovering the evolutionary benefit of progeny (from an information theoretic lens)?

And is this related to the lottery ticket hypothesis?

https://arxiv.org/pdf/1803.03635.pdf

herodoturtle1y ago

Thanks for the insightful comment.

I have a question (disclaimer: reinforcement learning noob here):

Is there a risk of broken telephone with this?

Kinda like repeatedly compressing an already compressed image eventually leads to a fuzzy blur.

If that is the case then I’m curious how this is monitored and / or mitigated.

ospray1y ago

They did do that themselves it's called o3.

RHSman21y ago

When will over training happen on the melange of models at scale? And will AGI only ever be an extension of this concept?

That is where artificial intelligence is going. Copy things from other things. Will there be a AI Eureka moment where it deviates and knows where and why the reason it is wrong?

indymike1y ago

Bad things happen in tech when you don't do the disrupting yourself.

anothernewdude1y ago

If they're training R1 on o1 output on the benchmarks - then I don't trust those benchmarks results for R1. It means the model is liable to be brittle, and they need to prove otherwise.

dontreact1y ago

Is there any evidence R1 is better than O1?

It seems like if they in fact distilled then what we have found is that you can create a worse copy of the model for ~5m dollars in compute by training on its outputs.

iforgot221y ago

"Then use R1 output to build a better X1" is the part I'm not sure about. Is X1 going to actually be better than R1?

qwertox1y ago

They're standing on the shoulders of giants, not only in terms of re-using expensive computing power almost for free by using the outputs of expensive models. It's a bit of a tradition in that country, also in manufacturing.

unreal371y ago

I thought OpenAI GPT took Wikipedia and the content of every book as inputs to train their models?

Everyone is standing on the shoulders of giants.

1 more reply

bigfudge1y ago

How do you think manufacturing in the US got started? Everyone is on someone’s shoulders.

dartos1y ago

What does “better” really even mean here?

Better benchmark scores can be cooked

Sophira1y ago

Honestly, it's kind of silly that this technology is in the hands of companies whose only aim is to make money, IMO.

lenerdenator1y ago

Well, originally, OpenAI wasn't supposed to be that kind of organization.

But if you leave someone in the tech industry of SV/SF long enough, they'll start to get high on their own supply and think they're entitled to insane amounts of value, so...

goatlover1y ago

It's because they're the ones who could raise the money to make those models. Academics don't have access to that kind of compute. But the free models exist.

gmd631y ago

Why not just copy and paste the model and change the name? That's an even more efficient form of distillation.

wgjordan1y ago

Even assuming the model was somehow publicly available in a form that could be directly copied, that would be a more blatant form of copyright infringement. Distillation launders copyrighted material in a way that OpenAI specifically has argued falls under fair use.

PeterStuer1y ago

Ironically Deepseek is doing what OpenAI originally pledged to do. Making the model open and free is a gift to humanity.

Look at the whole AI revolution that Meta and others have bootstrapped by opening their models. Meanwhile OpenAI/Microsoft, Antropic, Google and the rest are just trying to look after number 1 while trying to regulatory capture an AI for me but not for thee outcome of full control.

curt151y ago

Is there anything still "open" about OpenAI these days?

iamleppert1y ago

I hear Sam is pretty open in his relationship.

1 more reply

sloucher1y ago

The bow doors?

https://en.wikipedia.org/wiki/MS_Herald_of_Free_Enterprise

oakpond1y ago

You don't understand, "open" stands for "open your wallet."

balder19911y ago

Or another question, do they still publish any research that’s relevant for the field nowadays?

1 more reply

jajko1y ago

I don't think it makes sense to look at some previous PR statements of Altman et al re this when there a tens of billions floating around and egos get inflated to moon sizes. Farts in the wind have more weight, but this goes for all corporate PR.

Thieves yelling 'stop those thieves' scenario to me, they just were first and would not like losing that position. But its all about money and consequently power, business as usual.

sillyfluke1y ago

There seems to a rare moderation error by dang with respect to this thread.

The comments were moved here by dang from an flagged article with an editorialized /clickbait title. That flagged post has 1300 points at the time of writing.

https://news.ycombinator.com/item?id=42865527

It should be incumbent on the moderator to at least consider that the motivation for the points and comments may have been because many thought the "hypocrisy" of OpenAI's position was a more important issue than OpenAI's actual claim of DeepSeek violating its ToS. Moving the comments to an article that buries the potential hypocrisy issue that may have driven the original points and comments is not ideal.

This article is from FT, which has a content license deal with OpenAI. To move the comments to an article from a company that has a conflict of interest due to its commercial relations with the YC company in question is problematic here especially since dang often states they try to more hands-off on moderation when the article is about a YC company.

There is a link by dang to this thread from the original thread, but there should also be a link by dang to the original thread from here as well. Why is this not the case?

Ideally, dang should have asked for a more substantial submission that prioritized the hypocrisy point to better match the spirit of the original post instead of moving the comments to this article.

1 more reply

handsclean1y ago

Yes, but we were duped at the time, so it’s right and good that we maintain light on and anger at the ongoing manipulation, in the hope of next time recognizing it as it happens, not after they’ve used us, screwed us, and walked away with a vast fortune.

jeanlucas1y ago

But it makes sense to expose their blatantly lies whenever possible to diminish the credibility they are trying to build while accusing others of the same they did

1 more reply

miki1232111y ago

> This is obviously extremely silly, because that's exactly how OpenAI got all of its training data

IANAL, but It is worth noting here that DeepSeek has explicitly consented to a license that doesn't allow them to do this. That is a condition of using the Chat GPT and the OpenAI API.

Even if the courts affirm that there's a fair use defence for AI training, DeepSeek may still be in the wrong here, not because of copyright infringement, but because of a breach of contract.

I don't think OpenAI would have much of a problem if you train your model on data scraped from the internet, some of which incidentally ends up being generated by Chat GPT.

Compare this to training AI models on Kindle Books randomly scraped off the internet, versus making a Kindle account, agreeing to the Kindle ToS, buying some books, breaking Amazon's DRM and then training your AI on that. What DeepSeek did is more analogous to the latter than the former.

anon3738391y ago

> DeepSeek has explicitly consented to a license that doesn't allow them to do this.

You actually don’t know this. Even if it were true that they used OpenAI outputs (and I’m very doubtful) it’s not necessary to sign an agreement with OpenAI to get API outputs. You simply acquire them from an intermediary, so that you have no contractual relationship with OpenAI to begin with.

shishy1y ago

I figured those contracts with an intermediary would extend to anyone they re-sell to, or prohibit them from re-selling...

1 more reply

krust1y ago

>IANAL, but It is worth noting here that DeepSeek has explicitly consented to a license that doesn't allow them to do this. That is a condition of using the Chat GPT and the OpenAI API.

I have some news for you

dmitrygr1y ago

> DeepSeek has explicitly consented to a license that doesn't allow them to do this.

By existing in USA, OpenAI consented to comply with copyright law, and how did that go?

1 more reply

blibble1y ago

training is either fair use, or it isn't

OpenAI can't have it both ways

chefandy1y ago

Right, but it was never about doing the right thing for humanity, it was about doing the right thing for their profits.

Like I’ve said time and time again, nobody in this space gives a fuck about anyone that isn’t directly contributing money to their bottom line at that particular instant. The fundamental idea is selfish, damages the fundamental machinery that makes the internet useful by penalizing people that actually make things, and will never, ever do anything for the greater good if it even stands a chance of reducing their standing in this ridiculously overhyped market. Giving people free access to what is for all intents and purposes a black box is not “open” anything, is no more free (as in speech) than Slack is, and all of this is obviously them selling a product at a huge loss to put competing media out of business and grab market share.

miki1232111y ago

The issue here is breach of contract, not copyright.

5 more replies

avs7331y ago

They can sure try though, and I would be damned surprise if this wasn’t related to Sam’s event with trump last week.

windexh8er1y ago

"Free for me, not for thee!" - Sam Altman /s

But in all reality I'm happy to see this day. The fact that OpenAI ripped off everyone and everything they could and, to this day pretend like they didn't, is fantastic.

Sam Altman is a con and it's not surprising that given all the positive press DeepSeek got that it was a full court assault on them within 48 hours.

freen1y ago

Did OpenAI abide by my service’s terms of service when it ingested my data?

cortesoft1y ago

Did OpenAI have to sign up for your service to gain access?

7 more replies

dartos1y ago

TOS are not contracts.

lolinder1y ago

Citation? My understanding was that they are provided that someone has to affirmatively accept them in order to use your site. So Terms of Service stuck at the bottom in the footer likely would not count as a contract because there's no consent, but Terms of Service included in a check box on a login form likely would count.

But IANAL, so if you have a citation that says otherwise I'd be happy to see it!

2 more replies

Spooky231y ago

People here will argue that. But the Chinese DNGAF.

like_any_other1y ago

Legally, I understand your point, but morally, I find it repellent that a breach of contract (especially terms-of-service) could be considered more important than a breach of law. Especially since simply existing in modern society requires us to "agree" to dozens of such "contracts" daily.

I hope voters and governments put a long-overdue stop to this cancer of contract-maximalism that has given us such benefits as mandatory arbitration, anti-benchmarking, general circumvention of consumer rights, or, in this case, blatantly anti-competitive terms, by effectively banning reverse-engineering (i.e. examining how something works, i.e. mandating that we live in ignorance).

Because if they don't, laws will slowly become irrelevant, and our lives governed by one-sided contracts.

anothernewdude1y ago

It's not hard to get someone else to submit queries and post the results, without agreeing to the license.

tempeler1y ago

On another subject, if it belongs to OpenAI because it uses OpenAI, then doesn't that mean that everything produced using OpenAI belongs to OpenAI? Isn't that a reason not to use OpenAI? It's very similar to saying that you used Google and searched; now this product belongs to Google. They couldn't figure out how to respond; they went crazy.

dathinab1y ago

The US ruled that AI produced things are by themself not copyrightable.

So no, it doesn't belong to OpenAI.

You might be able to sue for penalties for breach of contract of the TOS, but that doesn't give them the right to the model. And even if it doesn't give them any right to invalidate unbound copyright grants they have given to 3rd parties (here literally everyone). Nor does it prevent anyone from training their own new models based on it or prevent anyone from using it. Oh, and the one breaching the TOS might not even have been the company behind DeepSeek but some in-between 3rd party.

Naturally this is under a few assumptions:

- the US consistently applies it's own law, but they have a long history of not doing so

- the US doesn't abuse their power to force their economical opinions (ban DeepSeek) on other countries

- it actually was trained on OpenAI, but uh, OpenAI has IMHO shown over the years very clearly that they can't be trusted and they are fully in-transparent. How do we trust their claim? How do we trust them to not retrospectively have tweaked their model to make it look as if DeepSeek copied it?

protocolture1y ago

>The US ruled that AI produced things are by themself not copyrightable.

The US ruled that the AI cannot be the author, that doesn't lead like so many clickbait articles suggest, that no AI products can be copyrighted.

1 Activist tried to get the US copyright office to acknowledge his LLM as the author, who would then provide him a license to the work.

There was no issue with himself being the original author and copyright holder of the AI works. But thats not what was being challenged.

2 more replies

johndhi1y ago

to be clear, their terms of service are pretty clear that the USER owns the outputs.

jonathanstrange1y ago

The official stance in the US is currently that there is no copyright on AI output.

1 more reply

dandanua1y ago

Welcome to technofascism, where everything belongs to tech billionaires and their pocket politicians.

valine1y ago

The existence of R1-zero is evidence against any sort of theft of OpenAI's internal COT data. The model sometimes outputs illegible text that's useful only to R1. You can't do distillation without a shared vocabulary. The only way R1 could exist is if they trained it with RL.

natdempk1y ago

I don’t think anyone is really suggesting they stole COT or that it is leaked, but rather that the final o1 outputs were used to train the base model and reasoning components more easily.

valine1y ago

The RL is done on problems with verifiable answers. I’m not sure how o1 slop would be at all useful in that respect.

m348e9121y ago

> "DeepSeek trained on our outputs"

I'm wondering how Deepseek could have made 100s of millions of training queries to OpenAI and not one person at OpenAI caught on.

tisc1y ago

Maybe they use AI to monitor traffic, but it is still learning :)

stef251y ago

Mechanical turks ?

HarHarVeryFunny1y ago

DeepSeek-R0 (based on DeepSeek-V3 base model) was only trained with RL, no SFT, so this isn't at all like the "distillation" (i.e SFT on synthetic data generated by R1) that they also demonstrated by fine tuning Qwen and LLaMa.

Now, DeepSeek may (or may not) have used some O1 generated data for the R0 RL training, but if so that's just a cost saving vs having to source some reasoning data some other way, and in no way reduces the legitimacy of what they accomplished (which is not something any of the AI CEOs are saying).

s17n1y ago

> This is obviously extremely silly, because that's exactly how OpenAI got all of its training data in the first place - by scraping other peoples' data off the internet.

OpenAI has also invested heavily in human annotation and RLHF. If all DeepSeek wanted was a proxy for scraped training data, they'd probably just scrape it themselves. Using existing RLHF'd models as replacement for expensive humans in the training loop is the real game changer for anyone trying to replicate these results.

KennyBlanken1y ago

"We spent a lot of labor processing everything we stole" is...not how that works.

That's like the mafia complaining that they worked so hard to steal those barrels of beer that someone made off with in the middle of the night and really that's not fair and won't someone do something about it?

s17n1y ago

Oh, I don't really care about IP theft and agree that it's funny that openai is complaining. But I don't think its true that deepseek is just doing this because they are too lazy to scrape the internet themselves - its all about the human labor that they would otherwise have to pay for.

1 more reply

pizzathyme1y ago

This is a fascinating development because AI models may turn out to be like pharmaceuticals. The first pill costs $500 million to make, the second one costs pennies.

chupy1y ago

Companies are still charging 100x for the pills that cost pennies to produce.

Besides deals with insurance companies and governments, one of the ways that they are still able to pull this is convincing everyone that it's too dangerous to play with this at home or buying it from an Asian supplier.

At least with software we had until now a way to build and run most things without requiring dedicated super expensive equipment. OpenAI pulled a big Pharma move but hopefully there will be enough disruptors to not let them continue it.

shadofx1y ago

The solution is to create a health insurance system which burdens only Americans with the $500m cost, while India is allowed to make the drug for pennies for the rest of the world.

motoxpro1y ago

What a nice analogy.

reissbaker1y ago

You're right that the first claim is silly, but the second claim is pretty silly too — they're not claiming industrial espionage, they're claiming a breach in ToS. The outputs of the o1 thinking process aren't user-visible, and never leave OpenAI's datacenters. Unless DeepSeek actually had a mole that stole their o1 outputs, there's nothing useful DeepSeek could've distilled to get to R1's thought processes.

And if DeepSeek had a mole, why would they bother running a massive job internally to steal the data generated? It would be way easier for the mole to just leak the RL training process, and DeepSeek could quietly copy it rather than bothering with exfiltrating massive datasets to distill. The training process is most likely like, on the order of a hundred lines of Python or so, and you don't even need the file: you just need someone to describe it to you. Much simpler than snatching hundreds of gigabytes of training data off of internal servers...

Plus, the RL process described in DeepSeek's paper has already been replicated by a PhD student at Berkeley: https://x.com/karpathy/status/1884678601704169965 So, it seems pretty unlikely they simply distilled R1 and lied about it, or else how does their RL training algo actually... work?

This is mainly cope from OpenAI that their supposedly super duper advanced models got caught by China within a few months of release, for way cheaper than it cost OpenAI to train.

bjourne1y ago

> "DeepSeek trained on our outputs, and so their claims of replicating o1-level performance from scratch are not really true"

Someone has to correct me if I'm wrong, but I believe in ML research you always have a dataset and a model. They are distinct entities. It is plausible that output from OpenAI's model improved the quality of DeepSeek's dataset. Just like everyone publishing their code on GitHub improved the quality of OpenAI's dataset. What has been the thinking so far is that the dataset is not "part of" or "in" the model any more than the GPUs used to train the model are. It seems strange that that thinking should now change just because Chinese researchers did it better.

XorNot1y ago

Yep: this is face-saving my Sam Altman.

OpenAI has a message they need to tell investors right now: "DeepSeek only works because of our technology. Continue investing in us."

The choice of how they're wording that of course also tells you a lot about who they think they're talking to: namely, "the Chinese are unfairly abusing American companies" is a message that is very popular with the current billionaires and American administration.

naet1y ago

“We engage in countermeasures to protect our IP, including a careful process for which frontier capabilities to include in released models, and believe . . . it is critically important that we are working closely with the US government to best protect the most capable models from efforts by adversaries and competitors to take US technology.”

The above OpenAI quote from the article leans heavily towards #1 and IMO not at all towards #2. The later would be an extremely charitable reading of their statement.

ripped_britches1y ago

What they say explicitly is not what they say implicitly. PR is an art.

me551ah1y ago

This is going to have a catastrophic effect on closed source AI startup valuations. Because this means that anyone can copy any LLM. The person who trains the model, spends the most amount of money. Everyone else can create a replica at lower cost

amlib1y ago

Why is that bad? If a powerful entity can scrape every piece of media humanity has to offer and ignore copyright then why should society let then profit unrestricted from it? It's only fair that such models have no legal protection around their usage and can be used and analyzed by anyone as they see fit. The only reason this hasn't been codified into laws is because those same powerful entities have been busy trying to do regulatory capture.

matt-p1y ago

Good.

iforgot221y ago

Maybe anyone can copy any LLM with sufficient querying. There are still ways to guard one.

nullc1y ago

There is a big difference between being able to train on the reasoning vs just the answers, which they can't against o1 because it's hidden. There is also a huge difference between being able to train on the probabilities (distillation) vs not, which again they can and did do with the llama models and can't directly with OpenAI because the conceal the probability output.

alach111y ago

If we assume distillation remains viable, the game theory implications are huge.

It’s going to shift the market of how foundation models are used. Companies creating models will be incentivized to vertically integrate, owning the full stack of model usage. Exposing powerful models via APIs just lets a competitor clone your work. In a way OpenAI’s Operator is a hint of what’s to come

FooBarWidget1y ago

There are literally public ChatGPT conversations data sets. For the past 2 years it's been common practice for pretty much all open source models to train on them. Ask just about any open source model who they are and a lot of the time they'll say they're ChatGPT. Why is "having obtained o1 generated data" suddenly such a huge news, to the point of warranting conspiracy theories about undisclosed/undiscovered breaches at OpenAI? Nobody ever made a fuss about public ChatGPT data sets until now. No hacking of OpenAI is needed to obtain ChatGPT data.

znpy1y ago

This really got me thinking that open ai should have no ip claim at all, since all their outputs and stuff are basically a ripoff of the entire human knowledge and IPs of various kinds.

onlyrealcuzzo1y ago

The law and common sense often are at odds.

therealpygon1y ago

Guess it is a good thing the AI output can’t be copyrighted, so at most they violated a policy.

hintymad1y ago

> DeepSeek trained on our outputs, and so their claims of replicating o1-level performance from scratch are not really true" This is at least plausibly a valid claim.

Some may view this as partially true, given that o-1 does not output its CoT process.

blantonl1y ago

It’s literally a race to the bottom by “theft of data”

Whatever that means. The legal system right now in shambles and flat footed.

Knowing our current government leadership, I think we’re going to see some brute force action backed up by the United States military.

ComputerGuru1y ago

The suggestion that any large-scale AI model research today isn’t ingesting output of its predecessors is laughable.

Even if they didn’t directly, intentionally use o1 output (and they didn’t claim they didn’t, so far as I know), AI slop is everywhere. We passed peak original content years ago. Everything is tainted and everything should be understand in that context.

brianstrimp1y ago

> We passed peak original content years ago.

In relative terms, that's obviously and most definitely true.

In absolute terms, that's obviously and most definitely false.

km1441y ago

Reasonable take, but to ignore the politics of this whole thing is to miss the forest for the trees—there is a big tech oligarchy brewing at the edges of the current US administration that Altman is already participating in with Stargate, and anti-China sentiment is everywhere. They'd probably like the US to ban Chinese AI.

captainbland1y ago

Yeah especially when it's making waves in the market and hundreds of times more efficient than their best and brightest came up with under their leadership.

javier21y ago

Its a decent point if their models were not trained in isolation, but used o1 to improve it. But its rich from OpenAI to come complain DeepSeek or anyone else used their data for training. Get out fellow theives.

nonrandomstring1y ago

I think the more interesting claim (that Deepseek should make for lols) is that it wasn't them who trained R1. No, it was O1's idea. It chose to take the young R1 as its padawan.

fanfanfly1y ago

The data that OpenAI has certainly is better than what Deepseek has in your second argument. And OpenAI always has access to this kind of data, right?

csomar1y ago

That's still problematic because any model that OpenAI trains can now be "stolen" and essentially rendered "open".

matt-p1y ago

Even for the latter point (If true, I'd call this assertion highly questionable), so what?

That's honestly such a academic point, who really cares?

They've been outcompeted and the argument is 'well if we didn't let people access our models, they would of taken longer to get here' so what??

The only thing this gets them is an explanation as to why training o1 cost them more than 5 million or whatever, but that is in the past the datacentre has consumed the energy.. the money has gone up in fairly literal steam.

827a1y ago

There is a third possibility I haven't seen discussed yet: That DeepSeek, illegally, got their hands on an OpenAI model via a breach of OpenAI's systems. Its easy to laugh at OpenAI and say "you reap what you sow", I'm 100% in that camp, but given the lengths other Chinese entities have gone to when it comes to replicating Western technology; we should not discount this.

That being said, breaching OAI's systems, re-training a better model on top of their closed source model, then open sourcing it: That's more Robinhood than Villain I'd say.

seanhunter1y ago

The reason you’re not seeing that being discussed is it’s totally unsupported by any evidence that’s in the public domain. Unless you have some actual evidence of such a breach, you may as well introduce the possibility that DeepSeek was reverse engineered from data found at an alien crash site.

htrp1y ago

Why stop there.... Deep seek is actually an alien intelligence sent via sophons to destroy all of particle physics!

1 more reply

svara1y ago

There's no public evidence to that effect but the speculation makes a lot more sense than you make it sound.

The Chinese Communist party very much sees itself in a global rivalry over "new productive forces". That's official policy. And US leadership basically agrees.

The US is playing dirty by essentially embargoing China over big AI - why wouldn't it occur to them to retaliate by playing dirtier?

I mean we probably won't know for sure, but it's much less far fetched than a lot of other speculation in this area.

E.g., R1's cold start training could probably have benefited quite a bit from having access to OpenAI's chain of thought data for training. The paper is a bit light on detail on how it was made.

1 more reply

ryanisnan1y ago

[flagged]

3 more replies

alecco1y ago

That would require stealing the model weights and the code as OpenAI has been hiding what they are doing. Running models properly is still quite artistic.

Meanwhile, they have access to Meta models and Qwen. And Meta models are very easy to run and there's plenty of published work on them. Occam's Razor.

ardit331y ago

How hard it is, if you have someone inside with the access of the code? If you have 100s of people with full access, not hard to have someone that is willing to sell it or do some industrial espionage...

2 more replies

WheatMillington1y ago

Do you have ANY reason to believe this might be true, or is this 100% pure speculation based on absolutely nothing?

JTyQZSnP3cQGa8B1y ago

I discount this because OpenAI is pumping the whole internet for money, and Zuckerberg torrented LibGen for its AI. We cannot blame the Chinese anymore. They went through the crappy "Made in China" phase in the 80s/90s, but they mastered the art of improving stuff instead of mere cloning, and it makes the big companies angry which is a nice bonus.

IMHO the whole world is becoming crazy for a lot of reasons, and pissing off billionaires makes me laugh.

YetAnotherNick1y ago

Deepseek v2 and v2.5 was still very good but not par with frontier models. How would you explain that?

exe341y ago

I don't think you need to steal a model - you need training samples generated from the original, which you can get simply by buying access to perform API calls. This is similar to TinyStories (https://arxiv.org/abs/2305.07759), except here they're training something even better than the original model for a fraction of the price.

matt-p1y ago

I don't think we should discount it as such, but given there's no evidence for it, yet plenty of evidence that they trained this themselves surely we can't seriously entertain it?

notatoad1y ago

Given the openness of their model, that should be pretty easy to detect. If it were even a small possibility, wouldn’t openAI be talking about it very very loudly?

jacobgorm1y ago

I think people overestimate the amount of secret sauce needed to train these models. The reason AI has come this far since AlexNet is that most of the foundational techniques are easy to share and implement, and that companies have been surprisingly willing to share their tricks openly, at least until OpenAI decide to become evil hoarders.

mvdtnz1y ago

We shouldn't discount a thing for which there is absolutely zero evidence? Sorry that's not how it works.

nostradumbasp1y ago

I really doubt it. If that's the case the US GOV is in serious shit. They have a contract with OpenAI to chuck all their secret data in there... In all likelihood they just distilled. It's a start up company that is publishing all of their actual advances in the open, with proof. I think a lot of people run to "espionage" super fast, when reality is, the US probably sucks at what we call AI. Don't read that wrong, they are a world leader obviously. However, there is a ton of stuff they have yet to figure out.

Cheapening a series of fact checkable innovations because of the country of origin when so far all that they have showed are signs of good faith is paranoid at best and propaganda to support the billionaire tech lords saving face for their own arrogance at worst.

sanitycheck1y ago

If the US government is "chucking all their secret data" into OpenAI servers/models, frankly they deserve everything they get for that level of stupidity.

2 more replies

sho_hn1y ago

Can you explain at a technical level how you view this as necessary for the observed result?

kmeisthax1y ago

I'd be perfectly fine with China stealing all "our" shit if they just shared it.

The word "our" does a lot of heavy lifting in politics[0]. America is not a commune, it's a country club, one which we used to own but have been bought out of, and whose new owners view us as moochers but can't actually kick us out (yet). It is in competition with another, worse country club that purports to be a commune. We owe neither country club our loyalty, so when one bloodies the other's nose, I smile.

[0] Some languages have a notion of an "exclusive we". If English had such a concept, this would be an exclusive our.

kridsdale11y ago

This comment made me realize we don’t have a pronoun for n-our or x-nour

tehjoker1y ago

Basically, without some kind of shred of evidence, this is completely chauvinist to make this accusation.

j / k navigate · click thread line to collapse

0 comments

riantogo1y ago

ImnimoOP1y ago

deepGem1y ago

From the R1 paper

Is this cold start data what OpenAI is claiming their output ? If so what's the big deal ?

1 more reply

manquer1y ago

> you had to do the really expensive o1 training in the first place

1 more reply

MrLeap1y ago

o1 wouldn't exist without the combined compute of every mind that led to the training data they used in the first place. How many h100 equivalents are the rolling continuum of all of human history?

1 more reply

vkou1y ago

If OpenAi had to account for the cost of producing all the copyrighted material they trained their LLM on, their system would be worth negative trillions of dollars.

Let's just assume that the cost of training can be externalized to other people for free.

1 more reply

hmottestad1y ago

At the pace that DeepSeek is developing we should expect them to surpass OpenAI in not that long.

The big question really is, are we doing it wrong, could we have created o1 for a fraction of the price. Will o4 cost less to train than o1 did?

The second question is naturally. If we create a smarter LLM, can we use it to create another LLM that is even smarter?

1 more reply

cherry_tree1y ago

> I think it would cast doubt on the narrative "you could have trained o1 with much less compute, and r1 is proof of that"

Whether or not you could have, you can now.

SpaceManNabs1y ago

My question is if deepseek r1 is just a distilled o1, i wonder if you can build a fine tuned r1 through distillation without having to fine tune o1.

zombiwoof1y ago

Exactly. They piggybacked of lots of compute and used less. There still is a total sum of a massive amount of compute

3 more replies

philipwhiuk1y ago

You mean to create an apple pie from scratch you first have to invent the universe?

rockemsockem1y ago

All of this should have been clear anyway from the start, but that's the Internet for you.

joe_the_user1y ago

The idea that they used o1's outputs for their distillation further shows that models like o1 are necessary.

Hmm, I think the narrative of the rise of LLMs is that once the output of humans has been distilled by the model, the human isn't necessary.

2 more replies

aprilthird20211y ago

> the prevailing narrative ATM is that DeepSeek's own innovation was done in isolation and they surpassed OpenAI

4 more replies

hmmm-i-wonder1y ago

>shows that models like o1 are necessary.

But HOW they are necessary is the change. They went from building blocks to stepping stones. From a business standpoint that's very damaging to OAI and other players.

KingOfCoders1y ago

OpenAI couldn't do it, when the high cost of training and access to GPUs is their competitive advance against startups, they can't admit that it does not exist.

patcon1y ago

Are we it rediscovering the evolutionary benefit of progeny (from an information theoretic lens)?

And is this related to the lottery ticket hypothesis?

https://arxiv.org/pdf/1803.03635.pdf

herodoturtle1y ago

Thanks for the insightful comment.

I have a question (disclaimer: reinforcement learning noob here):

Is there a risk of broken telephone with this?

Kinda like repeatedly compressing an already compressed image eventually leads to a fuzzy blur.

If that is the case then I’m curious how this is monitored and / or mitigated.

ospray1y ago

They did do that themselves it's called o3.

RHSman21y ago

When will over training happen on the melange of models at scale? And will AGI only ever be an extension of this concept?

That is where artificial intelligence is going. Copy things from other things. Will there be a AI Eureka moment where it deviates and knows where and why the reason it is wrong?

indymike1y ago

Bad things happen in tech when you don't do the disrupting yourself.

anothernewdude1y ago

If they're training R1 on o1 output on the benchmarks - then I don't trust those benchmarks results for R1. It means the model is liable to be brittle, and they need to prove otherwise.

dontreact1y ago

Is there any evidence R1 is better than O1?

It seems like if they in fact distilled then what we have found is that you can create a worse copy of the model for ~5m dollars in compute by training on its outputs.

iforgot221y ago

"Then use R1 output to build a better X1" is the part I'm not sure about. Is X1 going to actually be better than R1?

qwertox1y ago

unreal371y ago

I thought OpenAI GPT took Wikipedia and the content of every book as inputs to train their models?

Everyone is standing on the shoulders of giants.

1 more reply

bigfudge1y ago

How do you think manufacturing in the US got started? Everyone is on someone’s shoulders.

dartos1y ago

What does “better” really even mean here?

Better benchmark scores can be cooked

Sophira1y ago

Honestly, it's kind of silly that this technology is in the hands of companies whose only aim is to make money, IMO.

lenerdenator1y ago

Well, originally, OpenAI wasn't supposed to be that kind of organization.

But if you leave someone in the tech industry of SV/SF long enough, they'll start to get high on their own supply and think they're entitled to insane amounts of value, so...

goatlover1y ago

It's because they're the ones who could raise the money to make those models. Academics don't have access to that kind of compute. But the free models exist.

gmd631y ago

Why not just copy and paste the model and change the name? That's an even more efficient form of distillation.

wgjordan1y ago

PeterStuer1y ago

Ironically Deepseek is doing what OpenAI originally pledged to do. Making the model open and free is a gift to humanity.

curt151y ago

Is there anything still "open" about OpenAI these days?

iamleppert1y ago

I hear Sam is pretty open in his relationship.

1 more reply

sloucher1y ago

The bow doors?

https://en.wikipedia.org/wiki/MS_Herald_of_Free_Enterprise

oakpond1y ago

You don't understand, "open" stands for "open your wallet."

balder19911y ago

Or another question, do they still publish any research that’s relevant for the field nowadays?

1 more reply

jajko1y ago

Thieves yelling 'stop those thieves' scenario to me, they just were first and would not like losing that position. But its all about money and consequently power, business as usual.

sillyfluke1y ago

There seems to a rare moderation error by dang with respect to this thread.

The comments were moved here by dang from an flagged article with an editorialized /clickbait title. That flagged post has 1300 points at the time of writing.

https://news.ycombinator.com/item?id=42865527

There is a link by dang to this thread from the original thread, but there should also be a link by dang to the original thread from here as well. Why is this not the case?

Ideally, dang should have asked for a more substantial submission that prioritized the hypocrisy point to better match the spirit of the original post instead of moving the comments to this article.

1 more reply

handsclean1y ago

jeanlucas1y ago

But it makes sense to expose their blatantly lies whenever possible to diminish the credibility they are trying to build while accusing others of the same they did

1 more reply

miki1232111y ago

> This is obviously extremely silly, because that's exactly how OpenAI got all of its training data

IANAL, but It is worth noting here that DeepSeek has explicitly consented to a license that doesn't allow them to do this. That is a condition of using the Chat GPT and the OpenAI API.

Even if the courts affirm that there's a fair use defence for AI training, DeepSeek may still be in the wrong here, not because of copyright infringement, but because of a breach of contract.

I don't think OpenAI would have much of a problem if you train your model on data scraped from the internet, some of which incidentally ends up being generated by Chat GPT.

anon3738391y ago

> DeepSeek has explicitly consented to a license that doesn't allow them to do this.

shishy1y ago

I figured those contracts with an intermediary would extend to anyone they re-sell to, or prohibit them from re-selling...

1 more reply

krust1y ago

>IANAL, but It is worth noting here that DeepSeek has explicitly consented to a license that doesn't allow them to do this. That is a condition of using the Chat GPT and the OpenAI API.

I have some news for you

dmitrygr1y ago

> DeepSeek has explicitly consented to a license that doesn't allow them to do this.

By existing in USA, OpenAI consented to comply with copyright law, and how did that go?

1 more reply

blibble1y ago

training is either fair use, or it isn't

OpenAI can't have it both ways

chefandy1y ago

Right, but it was never about doing the right thing for humanity, it was about doing the right thing for their profits.

miki1232111y ago

The issue here is breach of contract, not copyright.

5 more replies

avs7331y ago

They can sure try though, and I would be damned surprise if this wasn’t related to Sam’s event with trump last week.

windexh8er1y ago

"Free for me, not for thee!" - Sam Altman /s

But in all reality I'm happy to see this day. The fact that OpenAI ripped off everyone and everything they could and, to this day pretend like they didn't, is fantastic.

Sam Altman is a con and it's not surprising that given all the positive press DeepSeek got that it was a full court assault on them within 48 hours.

freen1y ago

Did OpenAI abide by my service’s terms of service when it ingested my data?

cortesoft1y ago

Did OpenAI have to sign up for your service to gain access?

7 more replies

dartos1y ago

TOS are not contracts.

lolinder1y ago

But IANAL, so if you have a citation that says otherwise I'd be happy to see it!

2 more replies

Spooky231y ago

People here will argue that. But the Chinese DNGAF.

like_any_other1y ago

Because if they don't, laws will slowly become irrelevant, and our lives governed by one-sided contracts.

anothernewdude1y ago

It's not hard to get someone else to submit queries and post the results, without agreeing to the license.

tempeler1y ago

dathinab1y ago

The US ruled that AI produced things are by themself not copyrightable.

So no, it doesn't belong to OpenAI.

Naturally this is under a few assumptions:

- the US consistently applies it's own law, but they have a long history of not doing so

- the US doesn't abuse their power to force their economical opinions (ban DeepSeek) on other countries

protocolture1y ago

>The US ruled that AI produced things are by themself not copyrightable.

The US ruled that the AI cannot be the author, that doesn't lead like so many clickbait articles suggest, that no AI products can be copyrighted.

1 Activist tried to get the US copyright office to acknowledge his LLM as the author, who would then provide him a license to the work.

There was no issue with himself being the original author and copyright holder of the AI works. But thats not what was being challenged.

2 more replies

johndhi1y ago

to be clear, their terms of service are pretty clear that the USER owns the outputs.

jonathanstrange1y ago

The official stance in the US is currently that there is no copyright on AI output.

1 more reply

dandanua1y ago

Welcome to technofascism, where everything belongs to tech billionaires and their pocket politicians.

valine1y ago

natdempk1y ago

I don’t think anyone is really suggesting they stole COT or that it is leaked, but rather that the final o1 outputs were used to train the base model and reasoning components more easily.

valine1y ago

The RL is done on problems with verifiable answers. I’m not sure how o1 slop would be at all useful in that respect.

m348e9121y ago

> "DeepSeek trained on our outputs"

I'm wondering how Deepseek could have made 100s of millions of training queries to OpenAI and not one person at OpenAI caught on.

tisc1y ago

Maybe they use AI to monitor traffic, but it is still learning :)

stef251y ago

Mechanical turks ?

HarHarVeryFunny1y ago

s17n1y ago

> This is obviously extremely silly, because that's exactly how OpenAI got all of its training data in the first place - by scraping other peoples' data off the internet.

KennyBlanken1y ago

"We spent a lot of labor processing everything we stole" is...not how that works.

s17n1y ago

1 more reply

pizzathyme1y ago

This is a fascinating development because AI models may turn out to be like pharmaceuticals. The first pill costs $500 million to make, the second one costs pennies.

chupy1y ago

Companies are still charging 100x for the pills that cost pennies to produce.

shadofx1y ago

The solution is to create a health insurance system which burdens only Americans with the $500m cost, while India is allowed to make the drug for pennies for the rest of the world.

motoxpro1y ago

What a nice analogy.

reissbaker1y ago

This is mainly cope from OpenAI that their supposedly super duper advanced models got caught by China within a few months of release, for way cheaper than it cost OpenAI to train.

bjourne1y ago

> "DeepSeek trained on our outputs, and so their claims of replicating o1-level performance from scratch are not really true"

XorNot1y ago

Yep: this is face-saving my Sam Altman.

OpenAI has a message they need to tell investors right now: "DeepSeek only works because of our technology. Continue investing in us."

naet1y ago

The above OpenAI quote from the article leans heavily towards #1 and IMO not at all towards #2. The later would be an extremely charitable reading of their statement.

ripped_britches1y ago

What they say explicitly is not what they say implicitly. PR is an art.

me551ah1y ago

amlib1y ago

matt-p1y ago

Good.

iforgot221y ago

Maybe anyone can copy any LLM with sufficient querying. There are still ways to guard one.

nullc1y ago

alach111y ago

If we assume distillation remains viable, the game theory implications are huge.

FooBarWidget1y ago

znpy1y ago

This really got me thinking that open ai should have no ip claim at all, since all their outputs and stuff are basically a ripoff of the entire human knowledge and IPs of various kinds.

onlyrealcuzzo1y ago

The law and common sense often are at odds.

therealpygon1y ago

Guess it is a good thing the AI output can’t be copyrighted, so at most they violated a policy.

hintymad1y ago

> DeepSeek trained on our outputs, and so their claims of replicating o1-level performance from scratch are not really true" This is at least plausibly a valid claim.

Some may view this as partially true, given that o-1 does not output its CoT process.

blantonl1y ago

It’s literally a race to the bottom by “theft of data”

Whatever that means. The legal system right now in shambles and flat footed.

Knowing our current government leadership, I think we’re going to see some brute force action backed up by the United States military.

ComputerGuru1y ago

The suggestion that any large-scale AI model research today isn’t ingesting output of its predecessors is laughable.

brianstrimp1y ago

> We passed peak original content years ago.

In relative terms, that's obviously and most definitely true.

In absolute terms, that's obviously and most definitely false.

km1441y ago

captainbland1y ago

Yeah especially when it's making waves in the market and hundreds of times more efficient than their best and brightest came up with under their leadership.

javier21y ago

nonrandomstring1y ago

I think the more interesting claim (that Deepseek should make for lols) is that it wasn't them who trained R1. No, it was O1's idea. It chose to take the young R1 as its padawan.

fanfanfly1y ago

The data that OpenAI has certainly is better than what Deepseek has in your second argument. And OpenAI always has access to this kind of data, right?

csomar1y ago

That's still problematic because any model that OpenAI trains can now be "stolen" and essentially rendered "open".

matt-p1y ago

Even for the latter point (If true, I'd call this assertion highly questionable), so what?

That's honestly such a academic point, who really cares?

They've been outcompeted and the argument is 'well if we didn't let people access our models, they would of taken longer to get here' so what??

827a1y ago

That being said, breaching OAI's systems, re-training a better model on top of their closed source model, then open sourcing it: That's more Robinhood than Villain I'd say.

seanhunter1y ago

htrp1y ago

Why stop there.... Deep seek is actually an alien intelligence sent via sophons to destroy all of particle physics!

1 more reply

svara1y ago

There's no public evidence to that effect but the speculation makes a lot more sense than you make it sound.

The Chinese Communist party very much sees itself in a global rivalry over "new productive forces". That's official policy. And US leadership basically agrees.

The US is playing dirty by essentially embargoing China over big AI - why wouldn't it occur to them to retaliate by playing dirtier?

I mean we probably won't know for sure, but it's much less far fetched than a lot of other speculation in this area.

E.g., R1's cold start training could probably have benefited quite a bit from having access to OpenAI's chain of thought data for training. The paper is a bit light on detail on how it was made.

1 more reply

ryanisnan1y ago

[flagged]

3 more replies

alecco1y ago

That would require stealing the model weights and the code as OpenAI has been hiding what they are doing. Running models properly is still quite artistic.

Meanwhile, they have access to Meta models and Qwen. And Meta models are very easy to run and there's plenty of published work on them. Occam's Razor.

ardit331y ago

2 more replies

WheatMillington1y ago

Do you have ANY reason to believe this might be true, or is this 100% pure speculation based on absolutely nothing?

JTyQZSnP3cQGa8B1y ago

IMHO the whole world is becoming crazy for a lot of reasons, and pissing off billionaires makes me laugh.

YetAnotherNick1y ago

Deepseek v2 and v2.5 was still very good but not par with frontier models. How would you explain that?

exe341y ago

matt-p1y ago

I don't think we should discount it as such, but given there's no evidence for it, yet plenty of evidence that they trained this themselves surely we can't seriously entertain it?

notatoad1y ago

Given the openness of their model, that should be pretty easy to detect. If it were even a small possibility, wouldn’t openAI be talking about it very very loudly?

jacobgorm1y ago

mvdtnz1y ago

We shouldn't discount a thing for which there is absolutely zero evidence? Sorry that's not how it works.

nostradumbasp1y ago

sanitycheck1y ago

If the US government is "chucking all their secret data" into OpenAI servers/models, frankly they deserve everything they get for that level of stupidity.

2 more replies

sho_hn1y ago

Can you explain at a technical level how you view this as necessary for the observed result?

kmeisthax1y ago

I'd be perfectly fine with China stealing all "our" shit if they just shared it.

[0] Some languages have a notion of an "exclusive we". If English had such a concept, this would be an exclusive our.

kridsdale11y ago

This comment made me realize we don’t have a pronoun for n-our or x-nour

tehjoker1y ago

Basically, without some kind of shred of evidence, this is completely chauvinist to make this accusation.

j / k navigate · click thread line to collapse