ClosedAI scraped human content without asking and they explained why this was acceptable... but when the outputs of their training corpus is scraped, it is THEIR dataset and this is NOT acceptable!
Oh, the irony! :D
I shared a few screenshots of DeepSeek answering using ChatGPT's output in yesterday's article!
https://semking.com/deepseek-china-ai-model-breakthrough-sec...
The max token output is only 8K (32K thinking tokens). O1 is 128k, which is far more useful, and it doesn’t get stuck like R1 does.
The hype around the DeepSeek release is insane and I’m starting to really doubt their numbers.
I've also compared o1 and (online-hosted) r1 on Qt/C++ code, being a KDE Plasma dev, and my impression so far was that the output is roughly on par. I've given both models some tricky tasks about dark corners of the meta-object system in crafting classes etc. and they came up with generally the same sort of suggestions and implementations.
I do appreciate that "asking about gotchas with few definitive solutions, even if they require some perspective" and "rote day-to-day coding ops" are very different benchmarks due to how things are represented in the training data corpus, though.
For instance Fireworks offers R1 with 164K/164K. They are far more expensive than DeepSeek though
I mean, couldn't that be because they're just overwhelmed by users at the moment?
> And the output is very bad - it mashes together the header and cpp file
That sounds way worse, and like, not something caused by being hugged to death though.
Aider recently stated DeepSeek is placed a the top of their benchmark though[1] so I'm inclined to believe it isn't all hype.
https://en.wikipedia.org/wiki/Illegal_number
> An AACS encryption key (09 F9 11 02 9D 74 E3 5B D8 41 56 C5 63 56 88 C0) that came to prominence in May 2007 is an example of a number claimed to be a secret, and whose publication or inappropriate possession is claimed to be illegal in the United States.
https://www.federalregister.gov/documents/2023/11/01/2023-24...
>(k) The term “dual-use foundation model” means an AI model that is trained on broad data; generally uses self-supervision; contains at least tens of billions of parameters; is applicable across a wide range of contexts; and that exhibits, or could be easily modified to exhibit, high levels of performance at tasks that pose a serious risk to security, national economic security, national public health or safety, or any combination of those matters, such as by: ...
https://en.wikipedia.org/wiki/Export_of_cryptography_from_th...
Read the two following sections of my blog post:
1. "Distilled language models"
2. "DeepSeek: Less supervision"
> We will attempt to directly build safe and beneficial AGI, but will also consider our mission fulfilled if our work aids others to achieve this outcome.
> Our primary fiduciary duty is to humanity. We anticipate needing to marshal substantial resources to fulfill our mission, but will always diligently act to minimize conflicts of interest among our employees and stakeholders that could compromise broad benefit.
> We will actively cooperate with other research and policy institutions; we seek to create a global community working together to address AGI’s global challenges.
I now view any moralistic statement by any of these big tech companies as complete and total bullshit, which is probably for the best, because that is what it is. These companies now exist solely to amass power and wealth. They will still use moralistic language to try to motivate their employees, but I hope folks still see it for the complete nonsense that it is.
[1] https://semking.com/wp-content/uploads/2025/01/DeepSeek-1024...
[2] https://www.bestbuy.com/site/help-topics/privacy-policy/pcmc...
https://www.linkedin.com/posts/organic-growth_deepseek-the-o...
I welcome friction, so I'll be blunt: I disagree with you, not because what you are saying is wrong but because you only consider systematic data collection.
That's not the issue here.
There's a difference between democracies like the United States or European countries, no matter how IMPERFECT they are, and a dictatorship that does not allow dissenting opinions.
There's a difference in how the data collected will be used.
Freedom of speech, even when it is relative, is better than totalitarianism.
Not that we could ever see what the NSA, CISA, ASIS, GCHQ, and other 3/4-letter agencies are actually doing with the collected data.
But they pinky promised to use it properly (or something), so, yay.
China considers industry to be completely subservient to government. Checks and balances are secondary to ideas like harmony and collective well being.
>There's a difference in how the data collected will be used.
>Freedom of speech, even when it is relative, is better than totalitarianism.
I don't disagree with "democracy is better than totalitarianism", but what does that have to do with collecting device information and IP addresses? Is that excuse a cudgel you can use against any behavior that would otherwise be innocuous? It's fine to be against deepseek because you're concerned about them getting sensitive data via queries, or even that their models be a backdoor to project chinese soft power, but hand wringing about device information and IP addresses is absurd. It makes as much sense as being concerned that the CCP/deepseek does meetings, because even though every other companies does meetings, CCP/deepseek meetings could be used for totalitarianism.
This kind of person has a lot of cognitive dissonance going on.
Separately, I do think that now that the Chinese leadership saw this, that they have the chops to pull this off and then some, they are probably going to rein in future innovations; they'll likely demand that the big future discoveries remain closed-sourced (or even unannounced/unpublicized).
Edit: I am not defending OpenAI and we are all enjoying the irony here. But it puts into perspective some of the wilder claims circulating that DeekSeek was able to somehow complete with OpenAI for only $5M, as if on a level playing field.
> Separately, I do think that now that the Chinese leadership saw this, that they have the chops to pull this off and then some, they are probably going to rein in future innovations; they'll likely demand that the big future discoveries remain closed-sourced (or even unannounced/unpublicized).
How do we know that this is not already happening with OpenAI/Meta and the U.S. government at some level? The concept of power is equal, whether we wanted it or not. We don't have to pretend to be "better" all the time.
Depends on whether they want these tools to be adopted in the wider world. Rightly or wrongly there is a lot of suspicion in the West and an open source approach builds trust.
If the allegation is true (we don't know yet), then what you've written perfectly proves the point everyone is making. ChatGPT wouldn't be here if it weren't for all the research and work that preceded it in terms of tons of scrapable content being available on the Internet, and it's not like OpenAI invented transformers either.
Nobody is accusing DeepSeek of hacking into OpenAI's systems and stealing their content. OpenAI is just saying they scraped them in an "unauthorized" manner. The hypocrisy is laughably striking, but sadly nobody has any shame anymore in this world it seems. Play me the world's tiniest violin for OpenAI.
US AI folk were leading for two years by just throwing more and more compute at the same thing that Google threw them like a bone years ago (namely transformers). They made next to no innovation in any area other than how to connect more compute together. The idea of additional inference time compute, looping the network back on its own outputs, which is the only significant conceptual advancement of last years was something I, as a layman, came up with after few days of thinking why AI sucks and what can be done to make it able to tackle problems that require iterative reasoning. They announced it few weeks after I came up with the idea, so it was in the works for some time, but it shows you how basic idea it was. There was nothing else.
Suddenly when there comes a small company that introduced few actual algorithmic advancements which resulted in 100x optimization which is something expected with algorithmic optimizations, the big AI suddenly went into full "dog ate my homework" mode. Blaming everyone and everything around.
Let's not mention the fact that if full outputs of their models could enable them to train a better model at 1% cost then it puts them in even worse light that they didn't do it.
We have and apples and oranges thing here which deepseek is intentionally leaning into. They get very cheap electricity and are bragging about their cheap cost, and OpenAI etc typically brag about how expensive their training is. But it’s all pr and lies.
The cost of $5.5 million was quoted at $2/GPU-hour which is a reasonable price for on-demand H100s that anyone in the US could access, and likely on the high side given bulk pricing and that they are using nerfed versions. OpenAI might be all pr and lies but everything I've seen so far says that deepseek's claims about cost are legit.
Plus of course for people within the tech bubble, plenty of research results on the value of synthetically augmented and expanded training data that put the impact past just regurgitating source data.
This whole episode is a failure of reporting what to expect next and projecting running costs etc. most of all.
Sorry for the Short: https://www.youtube.com/shorts/M0QyOp7zqcY
Well, I guess they really helped make this a reality!
And then where DeepSeek steal from next? Do they steal from themselves? Do they steal the stolen models they stole from the stolen data?
The AI Ponzi scheme...
It reminds me of 1984 in a sense. "Don’t you see that the whole aim of Newspeak is to narrow the range of thought? In the end we shall make thoughtcrime literally impossible, because there will be no words in which to express it."
Unlike 1984 I don't see this winnowing of new concepts as purposeful, but on the other hand I keep asking myself how we can be so stupid as to keep doing it.
1. scraping the internet and making AI out of it
2. using the AI from #1 to create another AI
are not the same thing.
So, if you really really care about ToS, then just never enter into a contract with OpenAI. Company A uses OpenAI to generate data and posts it on the open Internet. Company B scrapes open Internet, including the data from Company A [2].
[1]: Ownership of content. As between you and OpenAI, and to the extent permitted by applicable law, you (a) retain your ownership rights in Input and (b) own the Output. We hereby assign to you all our right, title, and interest, if any, in and to Output.
[2]: This is not hypothetical. When ChatGPT got first released, several big AI labs accidentally and not so accidentally trained on the contents of the ShareGPT website (site that was made for sharing ChatGPT outputs). ;)
#2 makes a big corp a bit angry
Indeed not the same thing
But arguably these actions share enough characteristics that it’s reasonable to place them in the same category. Something like: “products that exist largely/solely because of the work of other people”. The nonconsensual nature of this and the lack of compensation is what people understandably take issue with.
There is enough similarity that it evokes specific feelings about OpenAI when they suddenly find themselves on the other side of the situation.
It's funny if OpenAI were to complain about this, but at least on Twitter I don't see that much whining about it from OpenAI employees. Sam publicly praised DeepSeek.
I do see some of them spreading the "they're hiding GPUs they got through sanction evasion" theory, which is disappointing, though.
You’re right. The second one is far more ethical. Especially when stealing from a thief.
Doesn’t Sam Altman keep parroting they’re developing AI “for the good of humanity”? Well then, someone taking their model and improving on it, making it open-source, having it consume less, and having a cheaper API, should make him delighted. Unless he *gasp* was full of shit the whole time. Who could have guessed?
“I don't want to live in a world where someone else makes the world a better place better than we do”
- Gavin Belson
#2 is taking advantages from closedAI.
they are indeed different
2. scraping the AI from #1 and making AI out of it
The best part is "their IP" was humanity's scraped content and they are angry that DeepSeek did their job for them and gave it away for free.
OpenAI is mad about d2 (not d1). I'm not sure using public data is "stealing". In summary, these are two different things & need to be separate.
It is unclear if someone breaking someone else's copyright to use A can claim copyright on a work B, derived from A. My point is that OpenAI played loose with the copyright rules to build its various models, so the legality of their claims against DeepSeek might not be so strong.
OpenAI asserts 1. d2 was used by DeepSeek 2. All d2 belongs to OpenAI exclusively
Both are debatable for large number of reasons.
I wonder if this is going to come to an end through a combination of social media fatigue, social media fragmentation, and open source LLMs just giving it all back to us for free. LLMs are analogous to a "JPEG for ideas" so they're just lossy compression blobs of human thought expressed through language.
It cannot die soon enough
They scraped literally all the content of the internet without permissions. And I won't even be surprised if they scraped the output of other LLMs as well.
You may not owe altmen better, but you owe this community better if you're participating in it.
I find your behavior repulsive and fervently wish you would quit.
https://blog.samaltman.com/trump
https://www.reddit.com/r/YAPms/comments/1i7ry5m/sam_altman_g...
Only a truly talented piece of shit can be as prolific as this.
"He is irresponsible in the way dictators are."
Chef's kiss.
Edit:
Kids, don't aspire to be like Altman. We as a community need to espouse more values than tech is gonna tech.
And don’t aspire to be like those who saw what he is but made peace with it in exchange for silver.
The first link is from mid-2016. The second link is from January 2025.
It is entirely reasonable for someone to genuinely change his or her views of a person over the course of 8.5 years. That is a substantial length of time in a person’s life.
To me a “flip-flop” is when one changes views on something in a very short amount of time.
- Sam Altman - 2016
"If you elect a reality TV star as President, you can't be surprised when you get a reality TV show" - Sam Altman - 2017
"When the future of the republic is at risk, the duty to the country and our values transcends the duty to your particular company and your stock price." - Sam Altman - 2017
"I think I started that a little bit earlier than other people, but at this point I am in really good company" - Sam Altman - 2017 ( On his criticism of Trump )
"Very few people realize just how much @reidhoffman did and spent to stop Trump from getting re-elected -- it seems reasonably likely to me that Trump would still be in office without his efforts. Thank you, Reid!" - Sam Altman - 2020A community only espouses good values when it punishes bad behavior. How do we do this when those misbehaving are very rich, and attempting to punish the misbehavior has negative consequences on you? There just aren't many available tools that don't require significant sacrifices.
The reason the flip flops are so laughable to me is because they attempt to couch them in some noble, moralistic viewpoint, instead of the obvious reason "We own big companies, the government has extreme power to make or break these companies, and everyone knows kissing up to Trump is what is required to be on his good side."
Profiles in Cowardice, every last one of them.
One of my most contrarian positions is I still like and support Altman, despite most of the internet now hating him almost as much as they (justifiably) hate Elon. Was a fan of Sam pre-YC presidency and still am now.
(I also am a big fan of DeepSeek and its CEO.)
Their failure is important at a minimum to the future of the United States if not the world.
Society will always have crazy sociopaths destroying things for their own gain, and now is Altman's turn.
I think DeepSeek’s strategy to announce a misleading low cost (just the final training run that optimizes a base model that in turn is possibly based on OpenAI) is also purposeful. After all, High Flyer, the parent company of DeepSeek, is a hedge fund - and I bet they took out big short positions on Nvidia before their recent announcements. The Chinese government, of course, benefits from a misleading number being announced broadly, causing doubt among investors who would otherwise continue to prop up American technology startups. Not to mention the big fall in American markets as a result.
I do think there’s also a big difference between scraping the Internet for training data, which might just be fair use, and training off other LLMs or obtaining their assets in some other way. The latter feels like the kind of copying and industrial espionage that used to get China ridiculed in the 2000s and 2010s. Note that DeepSeek has never detailed their training data, even at a high level. This is true even in their previous papers, where they were very vague about the pre training process, which feels suspicious.
Good for them! I hope this teaches Wall Street to not freak out about an unverified announcement.
Wall Street lost billions, and I hope they learned their lesson and next time will not crash the market when unverified news comes out.
Being a citizen of a western nation, I'm inclined to agree with the general sentiment here, but how can you definitively say this? You, or I, don't know with any certainty what interference the US government has played with domestic LLMs, or what lies they have fabricated and cultivated, that are now part of those LLMs' collective knowledge. We can see the perceived censorship with deepseek more clearly, but that isn't evidence that we're in any safer territory.
There are loads of examples on the internet of LLMs pushing (foreign) government narratives e.g. on Israel-Palestine.
Just because you might agree with the propaganda doesn't make it any less problematic.
These arguments always remind me of the arguments against Huawei because they _might_ be spying on western countries. On the other hand we had the US government working hand in hand with US corporations in proven spying operations against western allies for political and economic gain. So why should we choose an American supplier over a Chinese one?
> I think DeepSeek’s strategy to announce a misleading low cost (just the final training run that optimizes a base model that in turn is possibly based on OpenAI) is also purposeful. After all, High Flyer, the parent company of DeepSeek, is a hedge fund - and I bet they took out big short positions on Nvidia before their recent announcements. The Chinese government, of course, benefits from a misleading number being announced broadly, causing doubt among investors who would otherwise continue to prop up American technology startups. Not to mention the big fall in American markets as a result.
Why should I care about the stock value of US corporations?
> I do think there’s also a big difference between scraping the Internet for training data, which might just be fair use, and training off other LLMs or obtaining their assets in some other way.
So if training of copyrighted work scrapped of the Internet is fair use, how would the training of the LLMs not be fair use as well? You can't have it both ways.
Is corporate misinformation so much better? Recall about Tienanmen Square might be more honest but if LLMs had been available over the past 50 years, I would expect many popular models would have cheerfully told us company towns are a great place to live, cigarettes are healthy, industrial pollution has no impact on your health, and anthropogenic climate change isn't real.
Especially after the recent behaviour of Meta, Twitter, and Amazon in open support of Trump and Republican interests, I'll be shocked if we don't start seeing that reflected in their LLMs over the next few years.
I had literally come to this post to say the same. You beat me to it.
USA is going crazy over deepseek and to me , it just shows that the world is a black swan , an AI bubble.
I am not saying AI has no use. I regularly use it to create something , but its just not recommended. I am going to stop using AI , to grow my mind.
And its definitely way overpriced. People are investing so much money without seeing the returns? , and I think people are also using AI because of a sense of FOMO , I don't know , to me its funny .
I really really want to create a index fund with strictly no AI companies. Since this doesn't feel diversified enough. Like sure nvidia gave a quarter of return the last year , but I mean , at this point , it almost feels the same as that of bitcoin. The reason I don't / won't invest in bitcoin is I don't want "that" risk.
This has been a boggling year.
I have realized that the world is crazy. Truly. Trump winning from going to the point of getting shot to deepseek causing nvidia / american stock market to go down , heck even bitcoin! , its so crazy , trump launching his meme coin. If the world is crazy. Just be the sane person around. You will stick around , that's my philosophy. I won't jump on AI wandwagon . But its still absolutely wild & horror seeing how a "sideproject" (deepseek) absolutely put american stock market in shambles.
I want more diversifaction. I am not satisfied with the current system. This feels like a bubble and I want no part in it.