Like, when a DeepSeek dev uses these systems as intended, would they also be seeing the columns, keys, etc. in English? Is there usually a translation step involved? Or do devs around the world just have to bite the bullet and learn enough English to be able to use the majority of tools?
I'm realizing now that I'm very ignorant when it comes to non English-based software engineering.
That is precisely what happens. It is not unusual for code and databases to be written in English, even when the developers are from a non-English speaking country. Think about it: the toolchain, programming language and libraries are all based on English anyway.
However, a few years back it became common for most datasheets to be available in mandarin and english, and this year most PCB fabrication houses have gained support for putting chinese characters onto a circuit board (requires better quality printing, due to more definition needed for legibility).
Now there are a decent number of devices where the only documentation is only available in mandarin, and the design process was clearly done with little or no english involved.
Not everything changes though - gold plating thickness is measured by the micro-inch. Components often still use 0.1 inch pin spacing. Model numbers of chinese chips often are closely linked to the western chip they replace, the names of registers (in the cpu register sense) are often still english etc.
For some kinds of software, localized names make a lot more sense, e.g. when you're dealing with very subtle distinctions between legal terms that don't have direct English equivalents.
There are some business concepts that are very unique to a place (country-specific or even company-specific) with no precise translation to the English-speaking world, and so I sometimes prefer to keep them in their native language.
There is some merit in asking your question, for there’s an unspoken rule (and a source of endless frustration) that business-/domain-related terms should remain in the language of their origin. Otherwise, (real-life story) "Leitungsauskunft" could end up being translated as "line information" or even "channel interface" ("pipeline inquiry" should be correct, it's a type of document you can procure from the [German] government).
Ironically, I’m currently working in an environment where we decided to translate such terms, and it hasn’t helped with understanding of the business logic at all. Furthermore, it adds an element of surprise and a topic for debate whenever somebody comes up with a "more accurate translation".
So if anything, English is a sign of a battle-hardened developer, until they try to convert proper names.
I did try my hand at a translation tool, as it was all i18n up proper. Watched one guy blow coffee through his nose when I demo'ed - and the 'BACK' navigation was the French word for a persons back or something like that.
Depending on the company culture and policy, the most common thing to see is a mix of English variable and function names with native-language comments. Occasionally you will see native-language variable and function names. This is much more common in Latin character set languages (especially among Spanish and Portuguese speakers) in my experience; almost all Chinese code seems to use approximately-English variable and function names.
I'm a native English speaker, but from looking at various code bases written by people who aren't, I gather that it's basically this. It wasn't too long ago that one couldn't even reliably feed non-ASCII comments to a lot of compilers, let alone variable and function names.
Yes, that's what we did and do.
Depending on the project, I do use german variable names and comments at times, but stopped using all special characters like öüäß, they mess things up, despite in theory should just work fine.
Nowdays even chrome dev tools come in german, but experience shows, translated programming tools (or any software really) usually just have the UI a bit translated. But any errors you encounter or any advanced stuff will be in english anyway. And if you google issues of your translated UI, you won't find much, so better just use the original version.
So english it is.
(And it is the lingua franca in most parts of the world anyway)
When I interact with it by asking it a question in Spanish, the parts between the <think> ... </think> are in English before it goes on to answer in Spanish.
Give it a try in your favourite language.
I went on to ask it if it "thinks" in English, Spanish or Chinese but it just gives the pat answer that, being an LLM, it doesn't think in any language.
It also helps on the rare occasions some random notes evolve into a proper project that will have to be in English eventually anyway. There is no need for an extra translation step between initial idea and final product. All my vague hobby gamedev ideas are in English for instance.
That said, many developers might still prefer Turkish for naming DB tables, fields, variables, types and so forth if that’s the preference of the team. It wouldn’t be an exceptional situation. It’s quite easy too since Turkish also uses a Latin alphabet. May not be as easy or preferable in Chinese.
This way, you don't have to change keyboard layout while writing code.
Anyway, you're forced to learn some English when doing any real software development.
This makes some of their infra work and common misconceptions a little bit ... esoteric. So, English is crucial not just to do the job but to get best practices and CS info in general. It really helps a lot.
I mean we're kind of an outsourcing hub so it makes sense. Even some of our companies outsource further to the east so you really can't avoid it.
PS: I remember quite a while back when Wargaming's World of Tanks became a big thing they had to translate everything from Russian to English because they wanted to get foreign developers involved as well. Never heard of the reverse happening.
See also: aviation.
As programming languages keywords and APIs are written in english, it just looks better to keep it that way for identifiers and internal doc, the other way causes a dissonance for me which feels unconfortable.
Not only that. All of the code I (not a native English speaker) write, even if only I will ever see it, is in English - comments too. And I'm pretty confident all my colleagues do that too.
Might be different for languages with large population of native speakers (Croatian is just a few mil so we're more exposed to it), but you still can't avoid using English for tools / libs / docs / research papers / stack overflow...
That's how it goes, at least around Europe. People know English as a technical jargon (similar to legal French and Latin in English) and can juggle enough to get around documentation, but I've been in companies where I was the only fluent English speaker (and we're talking startup stuff). That gave ma a bunch of cool opportunities though, being pulled in every other meeting as the designated translator.
I do occaisonally find code with variable names in other languages, but it's very rare, for the most part if you want to code, English is the way.
I've also seen a few devs who used Hebrew variable names but spelled in English (`shalom` instead of שלום).
English is the universal language in programming and software engineering, much like Latin was the universal scholarly language in the past. Sometimes even to the extent that the language starts leaking from the code and technical documents, reports, etc. are being written in English, often just because the people working close to the software are more familiar with the terminology in English than in their native language.
Even when you install e.g. Debian today and select Not-English as the system language, you might be surprised to see that GCC actually has i18n'd error messages, at least for some languages. Same for coreutils. I doubt anyone uses that intentionally, and they're probably not very up to date, but it does exist... kinda.
https://en.wikipedia.org/wiki/Non-English-based_programming_...
However, I suspect it's a honey pot.
Don't forget Shenzhen is a stone's throw away from Hong Kong where English is widely spoken.
Yes, coding in english is the standard.
It just makes things A LOT easier in terms of debugging, researching, reading examples from documentation, etc etc. I don't even understand my (boomer) colleagues who straight up refuse to learn english and get angry when they can't find solutions with german search input
That.
There's also a huge mental switching context cost when you try to have code mixing, say, french and english together:
size_t taille;
site_t taille_domaine;
vs: size_t length;
size_t domain_length;
Hardly anyone does the former. It's simply not a thing. I mean: sure, there are the odd projects that'll be exceptions.But we pretty much all name our functions/methods/variables/etc. and write our comments in english.
FWIW when I code I actually both think and count in english.
- Dev infra, observability database (open telemetry spans)
- Logs of course contain chat data, because that's what happens with logging inevitably
The startling rocket building prompt screenshot that was shared is meant to be shocking of course, but most probably was training data to prevent deepseek from completing such prompts, evidenced by the `"finish_reason":"stop"` included in the span attributes.
Still pretty bad obviously and could have easily led to further compromise but I'm guessing Wiz wanted to ride the current media wave with this post instead of seeing how far they could take it. Glad to see it was disclosed and patched quickly.
As I understand, the finish reason being “stop” in API responses usually means the AI ended the output normally. In any case, I don't see how training data could end up in production logs, nor why they'd want to prevent such data (a prompt you'd expect to see a normal user to write) from being responded to.
> [...] I'm guessing Wiz wanted to ride the current media wave with this post instead of seeing how far they could take it.
Security researchers are often asked to not pursue findings further than confirming their existence. It can be unhelpful or mess things up accidentally. Since these researchers probably weren't invited to deeply test their systems, I think it's the polite way to go about it.
This mistake was totally amateur hour by DeepSeek, though. I'm not too into security stuff but if I were looking for something, the first thing I'd think to do is nmap the servers and see what's up with any interesting open ports. Wouldn't be surprised at all if others had found this too.
https://platform.openai.com/docs/api-reference/introduction
Right there in the docs:
> Now that you've generated your first chat completion, let's break down the response object. We can see the finish_reason is stop which means the API returned the full chat completion generated by the model without running into any limits.
Regarding how training data ends up in logs, it's not that far fetched to create a trace span to see how long prompts + replies take, and as such it makes sense to record attributes like the finish_reason for observability purposes. However the message being incuded itself is just amateur, but common nonetheless.
- Password authentication (bcrypt, sha256 hashes) - Certificate authentication (Fantastic for server to server communication) - SSH key authentication (Personally, this is my favourite - every database should have this authentication mechanism to make it easy for Dev to work with)
Not very popular but LDAP and Http Authentication Server are also great options.
I also wonder how DeepSeek engineers deployed their ClickHouse instance. When I deployed using yum/apt install, the installation step literally ask you to input a default password.
And if you were to set it up manually with ClickHouse binary, the out-of-the-box config seal the instance from external network access and the default user is only exposed to localhost as explained by Alex here - https://news.ycombinator.com/item?id=42871371#42873446.
forced us to use an alternative, and paywalling security features in an "open source" product didn't make us feel comfortable for a long-term investment like a db
https://github.com/ClickHouse/ClickHouse/pull/68634#issuecom...
If you do this and the company you're conducting your "research" on hasn't given you permission in some form, you can get yourself in a lot of hot water under the CFAA in the USA and other laws around the world.
Please don't follow this example. Sign up for a bug bounty program or work directly with a company to get permission before you probe and access their systems, and don't exceed the access granted.
> The Wiz Research team immediately and responsibly disclosed the issue to DeepSeek, which promptly secured the exposure
I did the same a while ago, an education platform startup had their web server misconfigured, I could clone their repo locally because .git was accessible. I immediately sent them an email from a throwaway account in case they wanted to get me in trouble and informed them about the configuration issues. They thanked me for the warning and suggestions, and even said they could get me a job at their company.
Wiz folks are notoriously shady. They cross the line a ton. They did this to Amazon and Microsoft to make a name among other. Super unethical.
Their product isn't terrible but their sales people are just terrible. Completely off-putting. Most of them are idiots from zscaler.
Likewise, there may be Chinese laws were violated. However, outside of China they are a moot point.
DeepSeek & users that had data exposed here should be thanking Wiz.
They are getting DoS’d by us gov too so they were only trying to help /s
Complete database control and potential privilege escalation within the DeepSeek environment without ANY authentication...
For most people, bash is not a tool for interacting with the computer, it is how they express their frustration with the computer (sometimes leaving damaged keyboards).
There you have, the real face of Big Tech. Extinguishing the competition by locking a service behind a portal provided for free, then starting to milk the users, is not enough for them... they will also fight dirty, really dirty.
But also one those of us actually working on foundational AI saw coming a mile away when most of the top research of late has been happening in Chinese labs, not American or European ones.
Can't wait to see what this boneheaded President's tarrif on TSMC does to this situation.
I don't understand the rage. This is good for everyone. Competition is what drives innovation and they even open sourced it! If you want to outdo them, learn from them. Don't just try to cry louder, it's embarrassing for everyone.
Can you please provide a source? Genuinely curious as this would be fatal to the US economy. Imagine working 2 years to get out from Covid chip shortages only to hammer progress down with tariffs.
NVIDIA's stock has been super bubbly—all DeepSeek did was set off itchy investor trigger fingers that were already worried about its highly inflated price.
DeepSeek uses H100s and H800s. They'll likely have reasons to buy more now, and America will want to compete even harder, buying more chips.
American companies are still way ahead as well, but they're just getting more competition. This will be healthy.
Tesla barely even sells but the stock just won't go down. Boeing orders have fallen massively and they're posting massive losses each quarter, and management shows zero desire to improve the situation. But the stock has basically stabilized since the initial catastrophes.
a) For inference, cheaper and faster compute will increase total inference spend, because the end-user products will work better and people will use them more.
b) For training, the big labs will continue to spend because we have yet to see diminishing returns to scale - in fact, we have in the past year unlocked a new dimension to scale up training-time compute - doing more RL after pre-training to improve reasoning capabilities. Since current SOTA models are not yet smart enough for all the tasks people want to use them for, this means that any efficiency gains will be used to further improve performance. In the current competitive environment, even with DeepSeek's work, it's near-impossible to imagine OpenAI, Anthropic, Google, or Meta deciding to cut the compute budget for training their next model by an order of magnitude. They will still incorporate DeepSeek's techniques into their next model, but use them to squeeze even more performance out of the compute they have, and will keep purchasing as much compute as NVidia will sell them. Expect this trend to continue until there are no more returns to scale anymore.
https://youtubetranscriptoptimizer.com/blog/05_the_short_cas...
Personally, I know I've lost a lot of street cred amongst certain work circles in recent history as far as my thoughts of 'shops should pursue local LLM solutions[0]' and the '$6000 4-8 tokens/second local LLM box' post making the rounds [1] hopefully gives orgs a better idea of what LLMs can do if we keep them from being 100% SAASlike in structure.
I think a big litmus test for some orgs in near future, is whether they keep 'buying ChatGPT' or instead find a good way to quickly customize or at least properly deploy such models.
[0] - I mean, for starters, a locally hosted LLM resolves a LOT of concerns around infosec....
[1] - Very thankful a colleague shared that with me...
"This is good for Nvidia" is the 2025 version of "this is good for bitcoin"
A lot of people want to poke at Chinese weakness wherever it’s exposed because Americans are used to being the best and also unconscious racism. When Japan was about to overtake the US the US pulled some similar moves and that is partly what’s responsible for japans current economic funk. It’s unlikely these moves will work on China.
Obviously i am disappointed regarding this, but people really blow this out of proportion imo. Rumor is this was a side project for some employees at a hedge fund. you think they specialize in security and software application best practices? i’m not exactly surprised that it’s insecure.
The really crazy thing is that anyone gives ANY company sensitive data to train on. regardless of which country the service is running in. That’s what’s actually crazy.
This whole thing should be an eye-opener to most people.
To ask an honest question, who gives a crap if a Chinese company manages to grab data that many of the usual Silicon Valley suspects have had all along and have been incrementally updating? How is this a "threat".
To pile on another gripe, why the hell does every single media outlet point out the "Tienanmen Square" question?
The whole thing has just become embarrassing. I honestly can't fathom what worse China could do with my personal info than the likes of say, Meta. I'm not saying I would enjoy it, but I just don't see how it could be more harmful than the Silicon Valley status quo.
Given how closely major US tech companies are now affiliated and partnered with the US Federal government, arguably the direct potential threat from them to US citizens may well be higher than from across a very big pond.
People trot out "I'd rather our guys spy on me than them" a lot, but that's putting a lot of faith in your local government. Conversely, who do you think has more to fear from their logged prompts on DeepSeek, US or Chinese citizens?
It is a threat to WallStreet and Silicon Valley. It just broke the illusion that they're kings of tech.
> why the hell does every single media outlet point out the "Tienanmen Square" question?
Sour grapes, but also the media cannot report anything about China without showing its anti-China bias.
No it isn't (well it probably is too). This is the rather naff nation state bollocks in play.
You have either or both of "some bigger boys found a more efficient way of doing something I thought I was good at" and "I've wet myself".
Seems like the kind of mistake you would make if you are not used to deploying external client facing applications.
[0] https://www.spiegel.de/international/business/we-know-where-...
https://news.ycombinator.com/item?id=41678840
The joke is these companies build systems that can tell them how to implement better security, they simply don't care.
I doubt it very much that it only was that and not massivly backed by the Chinese state in general.
As with OpenAI, much of this has to do with hype based speculation.
In the case of OpenAI they played with the speculations, that they might have AGI locked up in their labs already and fueled those speculations. The result, massive investment (now in danger).
And China and the US play a game of global hegemony. I just read articles with the essence of, see China is so great, that a small sideproject there can take down the leading players from the west! Come join them.
It is mere propaganda to me.
Now deepseek in the open is a good thing, but I believe the Chinese state is backing it up massivly to help with that success and to help shake the western world of dominance. I would also assume, the chinese intelligence services helped directly with Intel straight out of OpenAI and co labs.
This is about real power.
Many states are about to decide which side they should take, if they have to choose between West and East. Stuff like this heavily influences those decisions.
(But btw. most don't want to have to choose)
Alibaba has Qwen. Baidu, Huawei, Tencent, etc all have their own AI models. The Chinese government would most likely push one of these forward with their backing, not an unknown small company.
# Please install OpenAI SDK first: `pip3 install openai`
from openai import OpenAI
client = OpenAI(api_key="<DeepSeek API Key>", base_url="https://api.deepseek.com")The client facing aspect isn’t the problem here. This linked article is talking about the backend having vulnerabilities, not the client facing application. It’s about a database that is accessible from the internet, with no authentication, with unencrypted data sitting in it. High Flyer, the parent company of Deep Seek, already has a lot of backend experience, since that is a core part of the technologies they’ve built to operate the fund. If you’re a quantitative hedge fund, you aren’t just going to be lazy about your backend systems and data security. They have a lot of experience and capability to manage those backend systems just fine.
I’m not saying other companies are perfect either. There’s a long list of American companies that violate user privacy, or have bad security that then gets exploited by (often Chinese or Russian) hackers. But encrypting data in a database seems really basic, and requiring authentication on a database also seems really basic. It would be one thing if exposure of sensitive info required some complicated approach. But this degree of failure raises lots of questions whether such companies can ever be trusted.
You're reciting a bunch of absolute numbers, without any sort of context at all. $5M isn't the same for every company. For example, in 2020, it seems High Flyer spent a casual $27M on a supercomputer. They later replaced that with a $138M new computer. $5.5M sounds like something that could be like a side-project for a company like that, whose blood and sweat is literally money.
> But this degree of failure raises lots of questions whether such companies can ever be trusted.
This, I agree with though. I wouldn't trust sending my data over to them. Using their LLMs though, on my own hardware? Don't mind if I do, as long as it's better, I don't really mind what country it is imported from.
This is not fair. Is OpenAI, for example, including the CEO paycheck for the model training costs?
Zero evidence that the above statement is true, and weak evidence (authors' claims) that it is false. Have you read their papers even?
https://arxiv.org/html/2412.19437v1#abstract https://arxiv.org/pdf/2501.12948
I'm sure you were just mislead by all the people including Anthropic's Dario parroting this claim, but even Dario already said he was wrong to say that and semi analysis already clarified it was a misunderstanding of their claim, which was 50,000 H series, not 50,000 H100.
But the software? Absolute disaster.
When people say DeepSeek is a side project, this is what I assume they mean. It's different when a bunch of software engineers make something with terrible security because it's their main job. With bunch of academics (and no offense to academics), software is not their main job. You could be working on teaching them how to use version control.
Can we stop with this nonsense ?
The list of author of the paper is public, you can just go look it up. There are ~130 people on the ML team, they have regular ML background just like you would find at any other large ML labs.
Their infra cost multiple millions of dollar per month to run, and the salary of such a big team is somewhere in the $20-50M per year (not very au fait of the market rate in china hence the spread).
This is not a sideproject.
Edit: Apparently my comment is confusing some people. Am not arguing that ML people are good at security. Just that DS is not the side project of a bunch of quant bros.
So maybe not a side project, but if you have ever worked with ML researchers before, lack of engineering/security chops shouldn't be that surprising to you.
OP means to say public API and app being a side project, which likely it is, the skills required to do ML have little overlap to skills required to run large complex workloads securely and at scale for public facing app with presumably millions of users.
The latter role also typically requires experience not just knowledge to do well which is why experiences SREs have very good salaries.
Data breaches from unsecured or accidentally-public servers/databases are not unusual among much larger entities than DeepSeek.
We're not talking some poor college students here.
Not only that, this was a "production-grade" database with millions of users using it and the app was #1 on the app store and ALL text sent there in the prompts was logged in plain-text?
Unbelievable.
[1] https://www.reuters.com/technology/cybersecurity/openais-int...
A large scale DDos is being directed against deepseek.
US big tech wants to quench the competition.
If you're releasing a major project into the wild, expect serious attention and have the money, you get third parties involved to test for these things before you launch.
Now can we get back to discussing the real conspiracy theories. This is clearly a disinformation piece by BigAI to add FUD around the Chinese challenger :-)
No one is here as far as I can tell. But if you've ever been a software engineer who is required to work with someone purely from an ML lab and/or academia, you'll quickly discover that "principled software engineering" just isn't really something they consider an important facet of software. This is partly due to culture in academia, general inexperience (in the software industry) and deeply complicated/mathematical code really only needing to be read by other researchers who already "get it", to a degree.
Not an excuse but rather an explanation for _why_ such an otherwise impressive team might make a mistake like that.
I haven't worked with serious ML engineers, but having worked in large webdev there's usually a team involved in these projects, including senior none devs who would ensure the correct checks and balances are in place before go live. Does this not happen in ML projects? (of course there are always exceptions and unknowns that will slip through, I don't know if that was the case here, or something else)
i suggest you guys don't do that also
this industry in china is so young, many devs and orgs don't understand what will happened if they shutdown the firewall or expose their database on the internet without a password
they just, can't think of it, need someone to remind them
I would recommend that. Bitwarden is a pretty good open-source password manager. You can install it as a plugin in your browser, so it can fill out your password for you so you don't have to manually copy and paste.
You can fault them for disclosure practices though :-)
>The Wiz Research team immediately and responsibly disclosed the issue to DeepSeek, which promptly secured the exposure.
It seems like Wiz told deepseek and deepseek secured this vuln?
We all agree this kind of leak should be disclosed. However normally security researchers don't just leaks specific URL and etc. This may be what the parent is referring to.
If your information is sensitive, do not use an LLM by public API - absolutely all of your data is being stored and processed. For all of them.
Downvoted - because of course the CCP wouldn't want all of this data, that's preposterous. What would they even do with it? /S
Kinda like how your comment was grey within 1 minute, despite stating an objective truth.
Sure, this is to be expected given the billions and billions of dollars at stake but like - that money is gone lol. DeepSeek isn't going back in the bottle, nor is open source AI in general.
The direct disclosure of urls and ports is insane. Wonder if they would be as irresponsible if it was MSFT, OpenAI, Anthropic, etc.
PS: Not defending DeepSeek for bad practices, but still. Nothing irresponsible here.
PS2: It is marked as resolved, I went directly to the vulns due to the title of the post.
ed: I was wrong!
> The Wiz Research team immediately and responsibly disclosed the issue to DeepSeek, which promptly secured the exposure.
Assuming everything mentioned in the article was fixed before publication, I don’t see an issue with it.