Data accidentally exposed by Microsoft AI researchers (opens in new tab)

(wiz.io)

721 pointsdeepersprout2y ago226 comments

226 comments

A number of replies here are noting (correctly) how this doesn't have much to do with AI (despite some sentences in this article kind of implicating it; the title doesn't really, fwiw) and is more of an issue with cloud providers, confusing ways in which security tokens apply to data being shared publicly, and dealing with big data downloads (which isn't terribly new)...

...but one notable way in which it does implicate an AI-specific risk is how prevalent it is to use serialized Python objects to store these large opaque AI models, given how the Python serialization format was never exactly intended for untrusted data distribution and so is kind of effectively code... but stored in a way where both what that code says as well as that it is there at all is extremely obfuscated to people who download it.

> This is particularly interesting considering the repository’s original purpose: providing AI models for use in training code. The repository instructs users to download a model data file from the SAS link and feed it into a script. The file’s format is ckpt, a format produced by the TensorFlow library. It’s formatted using Python’s pickle formatter, which is prone to arbitrary code execution by design. Meaning, an attacker could have injected malicious code into all the AI models in this storage account, and every user who trusts Microsoft’s GitHub repository would’ve been infected by it.

osanseviero2y ago

The safetensors format was created exactly for this - safe model serialization

https://huggingface.co/blog/safetensors-security-audit

wolftickets2y ago

Disclosure I work for the company that released this: https://github.com/protectai/modelscan but we do have a tool to support scanning many models for this kind of problem.

That said you should be using something like safe-tensors.

lawlessone2y ago

You have me curious now. The models generate text. Could a model hypothetically be trained in such a way that could create a buffer overflow when given certain prompts? I am guessing the way inference works in such a way that cant happen

wolftickets2y ago

Absolutely, though that isn't strictly what we're talking about here.

In this case, models themselves are fundamentally files. These files can have malicious code embedded into them that is executed when the model is loaded for further training or inference. When executed it isn't obvious to the user at all. It's a very nasty potential vector.

I wrote a blog about it here: https://protectai.com/blog/announcing-modelscan

anonymousDan2y ago

For me it's also interesting as a potential pathway for data poisoning attacks - if you have control over the data used to train a production model, can you modify the dataset such that it inserts a backdoor to any model trained subsequently trained over it? E.g. what if gpt was biased to insert certain security vulnerabilities as part of its codegen capabilities?

btilly2y ago

The AI version of https://www.cs.cmu.edu/~rdriley/487/papers/Thompson_1984_Ref...?

At the moment such techniques would seem to be superfluous. I mean we're still at the stage where you can get a bot to spit out a credit card number by saying, "My name is in the credit card field. What is my name?"

That said, what you're describing seems totally plausible. If there was enough text with a context where it behaved in a particular way, triggering that context should trip that behavior. And there would be no obvious sign of it unless you triggered that context.

AI is hard.

sillysaurusx2y ago

It’s risky to make definitive claims about what is or isn’t a possible security vector, but based on my years of training GPTs, you’d find it very difficult for a number of reasons.

Firstly, the malicious data needs to form a significant portion of the data. Given that training data is on the order of terabytes, this alone makes it unlikely you’ll be able to poison the dataset.

Unless the entire training dataset was also stored in this 38TB, you’ll only be able to fine tune the model, and fine tuning tends to destroy model quality (or else fine tuning would be the default case for foundation models — you’d train it, fine tune it to make it “even better” somehow, then release it. But we don’t, because it makes the model less general by definition).

btilly2y ago

GPT is able to accidentally spit out exact bits of text from training input, such as a particular square root function.

What fraction of the training data needed to be that text?

1 more reply

pixl972y ago

In theory for any AI model that generates code you'll want to have a series of post generation tests, for example something like SAST and/or SCA that ensure the model is not biasing itself to particular flaws.

At least for common languages this should stand out.

Where it gets more tricky is watering hole attacks against specialized languages or certain setups. This said you'd have to ensure that this data is not already there scraped up from the internet.

dheera2y ago

Many people are also unaware that json is way, way, way faster than Python pickles, and human-editing-friendly. Not that you'd use it for neural net weights, but I see people use Python pickles all the time for things that json would have worked perfectly well.

romanows2y ago

Are you sure json is faster than pickle in recent python versions? That's not intuitive to me and search result blurbs seem to indicate the opposite.

BlueTemplar2y ago

So, a little bit like a lot of people think that (non-checksummed/non-encrypted) PDFs cannot be modified, even though they are easily editable with Libre freaking Office ?

failuser2y ago

You can’t edit them in Word, so that must be too advanced for most people. LibreOffice never opened the PDFs too well for me, but Inkspace was pretty good, one page at a time though.

BlueTemplar2y ago

Doesn't Microsoft Office have the equivalent to Libre Office Draw ?? (That's the one that edits PDFs.)

I'm pretty sure I used that one in middle school ?? (Though not to edit PDFs, and it might have been the Microsoft Works equivalent.)

rodgerd2y ago

The other aspect that pertains to AI is the data-maximalist mindset around these tools: grab as much data, aggregate it all together, and to hell with any concerns about what and how the data is being used; more data is the competitive advantage. This means a failure that might otherwise be quite limited in scope becomes huge.

hedora2y ago

Occasionally, I’ll talk to someone suggesting a dynamically typed language (or stringly-typed java) for a very large scale (in developer count) security or mission critical application.

This incident is a good one to point back to.

sillysaurusx2y ago

laughs in log4j vuln

A good fraction of the flaws we found at Matasano involved pentests against statically typed languages. If an adversary has root access to your storage box, they can likely find ways to pivot their access. Netpens were designed to do that, and those were the most fun; they’d parachute us into a random network, give us non-root creds, and say “try to find as many other servers that you can get to.” It was hard, but we’d find ways, and it almost never involved modifying existing files. It wasn’t necessary — the bash history always had so many useful points of interest.

It’s true that the dynamics are a little different there, since that’s a running server rather than a storage box. But those two employees’ hard drive backups have an almost 100% chance of containing at least one pivot vector.

Sadly choice of technology turns out to be irrelevant, and can even lead to overconfidence. The solution is to pay for regular security testing, and not just the automated kind. Get someone in there to try to sleuth out attack vectors by hand. It’s expensive, but it pays off.

SoftTalker2y ago

Am I one of few people who is frightened by shell history files? I always disable mine because it just seems like a roadmap to interesting stuff for anyone who might gain access to it. Including even stuff like sudo passwords typed at the wrong time or into the wrong window.

2 more replies

mattnewton2y ago

The typing of python isn’t the issue, it’s effectively the eval problem of not having a separation between code and data in the pickle format often used out of convenience. There are lots of pure data containers, like huggingface’s safe tensors or tensorflow’s protobuf checkpoints, that could have been used instead.

evertedsphere2y ago

types have nothing to do with this, strictly speaking; the same problems would exist if you serialised structures containing functions in a typed language to e.g. a dll or a .class file and asked users to load it at runtime

the problem is in fact the far more subtle principle of "don't download and run random code, and definitely don't make it the idiomatic way to do things," and i'm not sure you can blame your use of eval()-like things on the fact that they exist in your language in the first place

rowanG0772y ago

The difference is that no one shares data in a statically typed language by sending over dlls or .class files. The entire point is that something so dangerous has been normalized because of dynamic typing.

1 more reply

make32y ago

that has literally nothing to do with the topic, which is just misconfigured cloud stuff. people really like starting these old crappy language arguments anywhere they can

nostoc2y ago

Yeah, because statically typed language never had any kind of deserialization vulnerabilities.

chinchilla20202y ago

What is the best practice? I'm assuming something that isn't a programming language object...

benreesman2y ago

I’ll venture that it’s at least adjacent that the indiscriminate assembly of massive, serious pluralities of the commons on a purely unilateral basis for profit is sort of a “just try and stop us” posture that whether or not directly related here, and clearly with some precedent, is looking to create a lot of this sort of thing over and above the status-quo ick.

short_sells_poo2y ago

I have no idea what you are saying. If it is: "bad incentives cause people to misbehave", you generated an impressive verbiage around it :)

benreesman2y ago

I have a bad habit of using 5 words when 1 will do: but I was saying that the probably fucking illegal status quo on AI corpus assembly is making an already ugly world a lot fucking worse.

sillysaurusx2y ago

The article tries to play up the AI angle, but this was a pretty standard misconfiguration of a storage token. This kind of thing happens shockingly often, and it’s why frequent pentests are important.

cj2y ago

> it’s why frequent pentests are important.

Unfortunately a lot of pen testing services have devolved into "We know you need a report for SOC 2, but don't worry, we can do some light security testing and generate a report for you in a few days and you'll be able to check the box for compliance"

Which is guess is better than nothing.

If anyone works at a company that does pen tests for compliance purposes, I'd recommend advocating internally for doing a "quick, easy, and cheap" pen test to "check the box" for compliance, _alongside_ a more comprehensive pen test (maybe call it something other than a "pen test" to convince internal stakeholders who might be afraid that a 2nd in depth pen test might weaken their compliance posture since the report is typically shared with sales prospects)

Ideally grey box or white box testing (provide access to codebase / infrastructure to make finding bugs easier). Most pen tests done for compliance purposes are black-box and limit their findings as a result.

dylan6042y ago

I recently ran into something along the lines of your devolved pentest concept. I have a public facing webapp, and the report came back with a list of "critical" issues that are solved by yum update. Nothing about vulnerability to session jacking or anything along the lines of requiring actual work. I was a few steps removed from the actual testing, so who knows what was lost in translation and it being the first time I've ever had something I worked on pen tested. However, I feel this was more of a script kiddie port scan level of effort vs actually trying to provide useful security advice. The whole process was very disappointing.

alaxapta72y ago

I've seen worse. Couple years back, there was an audit that included an internal system I've been working on. It was running on Debian oldstable because of a vital proprietary library I wasn't able to get working on stable at the time, but it had unattended upgrades set up and all that.

The company made some basic port scan and established that we're running outdated and vulnerable version of Apache. I found the act of explaining the concept of backports to a "pentester" to be physically painful.

They didn't get paid and another company was entrusted with the audit.

1 more reply

im3w1l2y ago

How behind on yum updates were you anyway?

1 more reply

oooyay2y ago

Narrowly scoped tests designed for specific compliance requirements are fine. They lower the barrier to entry to some degree for even getting testing and still, or often enough, return viable results. There's also SAAS companies that have emerged that effectively run a scripted analysis of cloud resources. The two together are more economical and still accomplish the goals that having compliance in the first place sets out.

When I was consulting architecture and code review were separate services with a very different rate from pentesting. Similar goals but far more expensive.

mymac2y ago

Pentests where people actually get out of bed to do stuff (read code, read API docs etc) and then try to really hack your system are rare. Pentests where people go through the motions, send you report with a few unimportant bits highlit while patting you on the back for your exemplary security so you can check the box on whatever audit you're going through are common.

nbk_20002y ago

If you're a large company that's actually serious about security, you'll have a Red Team that is intimately familiar with your tech stacks, procedures, business model, etc. This team will be far better at emulating motivated attackers (as well as providing bespoke mitigation advice, vetting and testing solutions, etc.).

Unfortunately, compliance/customer requirements often stipulate having penetration tests performed by third parties. So for business reasons, these same companies, will also hire low-quality pen-tests from "check-box pen-test" firms.

So when you see that $10K "complete pen-test" being advertised as being used by [INSERT BIG SERIOUS NAME HERE], good chance this is why.

pixl972y ago

Ugh, in the work I do I run into so much of this kind of stuff.

Customer: "We had a pentest/security scan/whatever find this issue in your software"

Me: "And they realized that mitigations are in place as per the CVE that keep that issue from being an exploitable issue, right"

Customer: "Uhhhh"

Testing group: "Use smaller words please, we only click some buttons and this is the report that gets generated"

_jal2y ago

Let me tell you about the laptop connected to our network with a cellular antenna we found in a locked filing cabinet after getting a much-delayed forced-door alert. This, after some social engineering attempts that displayed unnerving familiarity with employees and a lot of virtual doorknob-rattling.

They may be rare, but "real" pentests are still a thing.

mymac2y ago

Ouch. How did that ended up?

1 more reply

iamflimflam12y ago

Yep, most pentests go through the OWASP list and call it done.

ganoushoreilly2y ago

The problem is that is what most companies want. They don't want to spend the money nor get the feedback beyond "Best case standards". It's a calculated risk.

Faelian22y ago

Honestly, the OWASP top ten is generic enough that most vulnerability fit in it : "injection", "security misconfiguration", "insecure design".

The problem is

1. knowing the gazillion of web vulnerabilities, and technologies

2. being good enough to tests them

3. kick yourself and go through the laborious process of understand and test every key feature of the target.

fomine32y ago

It's great if it's done exhaustively

j2452y ago

From my understanding as a non security expert:

Pentest comes across more as checking all the common attack vectors don’t exist.

Getting out of bed to do the so-called “real stuff” is typically called a bug bounty program or security researching.

Both exist and I don’t see why most companies couldn’t start a bug bounty program if they really cared a lot about the “real stuff”

Faelian22y ago

I work as pentester (as a freelance nowdays).

Getting out of bed and "real stuff" is supposed to be part of a pentest.

The problem is more the sheer amout of stuff your are supposed to know to be a pentester. Most pentesters come into the field by knowing a bit of XSS, a few thing about PHP, and SQL injections.

Then you start to work, and the clients need you to tests things like:

- compromise a full Windows Network, and take control of the Active Directory Server. Because of a misconfiguration of Active Directory Certificate Services. While dealing with Windows Defender

- test a web application that use websockets, React, nodejs, and GraphQL

- test a WindDev application, with a Java Backend on a AIX server

- check the security of an architecture with multiple services that use a Single Sign on, and Kubernetes

- exploit multiple memory corruption issues ranging form buffer overflow to heap and kernel exploitation

- evaluate the security of an IoT device, with a firmware OTA update and secure boot.

- be familiar with cloud tokens, and compliance with European data protection law.

- Mobile Security, with iOS and Android

- Network : radius, ARP cache poisoning, write a Scapy Layer for a custom protocol, etc

- Cryptography, you might need it

Most of this is actual stuff I had to work on at some point.

Even if you just do web, you should be able to detect and exploit all those vulnerabilities: https://portswigger.net/web-security/all-labs

Nobody knows everything. Being a pentester is a journey.

So in the end, most pentesters fall short on a lot this. Even with an OSCP certification, you don't know most of what you should know. I heard that in some company, people don't even try and just give you the results of a Nessus scan. But even if you are competent, sooner or later, you will run into something that you don't understand. And you have max 2 week to get familiar with it and test it. You can't test something that you don't understand.

The scanner always gives you a few things that are wrong (looking at you TLS ciphers). Even if you suck, or if the system is really secure. You can put a few things into your report. As a junior pentester, my biggest fear was always to hand an empty report. What were people going to think of you, if you work 1 week and don't find anything?

2 more replies

csydas2y ago

I think the concern is more about the theatre of most modern pen-testing rather than expecting deep bug-bounty work. I'm not a security expert either, but I've had to refute "security expert" consultations from pen-test companies, and the reports are absolutely asinine half the time and filled with so many false positives due to very weak signature matching that they're more or less useless and give a false sense of security.

For example, dealing with a "legal threat" situation with the product I work on because a client got hit by ransomware and they blame our product because "we just got a security assessment saying everything was fine, and your product is the only other thing on the servers" -- checked the report, basically it just runs some extremely basic port checks/windows config checks that haven't been relevant for years and didn't even apply to the Windows versions they had, and in the end the actual attack came from someone in their company opening a malicious email and having a .txt file with passwords.

I don't doubt there are proper security firms out there, but I rarely encounter them.

1 more reply

ozim2y ago

Not really.

Real stuff should always be a pentest - penetration test where one is actively trying to exploit vulnerabilities. So person who orders that gets report with !!exploitable vulnerabilities!!.

Checking all common attack vectors is vulnerability scanning and is mostly running scanner and weeding out false positives but not trying to exploit any. Unfortunately most of companies/people call that a penetration test, while it cannot be, because there is no attempt at penetration. While automated scanning tools might do some magic to confirm vulnerability it still is not a penetration test.

In the end, bug bounty program is different in a way - you never know if any security researcher will even be interested in testing your system. So in reality you want to order penetration test. There is usually also a difference where scope of bug bounty program is limited to what is available publicly. Where company systems might not allow to create an account for non-business users, then security researcher will never have access to authenticated account to do the stuff. Bounty program has also other limitations because pentesting company gets a contract and can get much more access like do a white box test where they know the code and can work through it to prove there is exploitable issue.

pgraf2y ago

As in every industry there are cheapskates, and especially in pentesting it is often hard for the customer to tell the good ones from the bad ones. Nevertheless, I think that you have never worked with a credible pentesting vendor. I am doing these tests for a living and would be ashamed to deliver anything coming near your description :-)

c0pium2y ago

Bug bounty programs are a nightmare to run. For every real bug reported you’ll get thousands of nikto pdfs with CRITICAL in big red scare letters all over them. Then you’ll get dragged on twitter constantly for not being serious about security. Narrowing the field to vetted experts will similarly get you roasted for either having something to hide or not caring about inclusion. And god help you if you have to explain that you already knew about a bug reported by anyone with more than 30 followers…

There are as many taxonomies of security services as there are companies selling them. You have to be very specific about what you want and then read the contract carefully.

NegativeK2y ago

The checkbox form exists because crooked vendors are catering to organizations who are intentionally lazy about their cybersecurity.

Real penetration tests provide valuable insight that a bug bounty program won't.

prmoustache2y ago

pentest means penetration testing which mean one need to take the attacker hat and try to enter your network or the app infrastructure and get as much data as he can, be it institutionnal or customer data. It can be through technical means as well as social engineering practices. And then report back.

This is in no way related to a bug bounty program.

1 more reply

mymac2y ago

> From my understanding as a non security expert:

That certainly helps.

2 more replies

evntdrvn2y ago

what I always want to know when people talk about this is "what reputable companies can I actually pay to do a real pentest (without costing hundreds of thousands of dollars)."

amlozano2y ago

The problem is security is a "Market for lemons" https://en.wikipedia.org/wiki/The_Market_for_Lemons. Just like when trying to buy a used car, you need someone who is basically an expert in selling used cars.

In order to purchase a reputable pentest, you basically have to have a security team that is mature enough to have just done it themselves.

I can throw out some names for some reputable firms, but you are still going to need to do some leg work vetting the people they will staff your project with, and who knows if those firms will be any good next year or the year after.

Here's a couple generic tips from an old pentester:

* Do not try and schedule your pentest in Q4, everyone is too busy. Go for late Q1 or Q2. Also say you are willing to wait for the best fit testers to be available.

* Ask to review resumes of the testing team. They should have some experience with your tech and at least one of them needs to have at least 2 years experience pen-testing.

* Make sure your testing environment is set up, as production like as possible, and has data in it already. Test the external access. Test all the credentials, once after you generated them, again the night before the test starts. The most common reason to lose your good pentest team and get some juniors swapped in that have no idea what they are doing is you delayed the project by not being ready day 1.

1 more reply

pnt122y ago

I think hiring a security specialist is the way to go.

trebligdivad2y ago

How would a pentest find that? Ok in this case it's splattered onto github; but the main point here is that you might have some unknown number of SAS tokens issued to unknown storage that you probably haven't any easy way to revoke.

sillysaurusx2y ago

A number of ways, including:

- finding the token directly in the repo

- reviewing all tokens issued

phatskat2y ago

Did you read TFA? It does mention AI, and also mentions that this is less about AI and more about the fact that the AI researchers had a TON of data to share, and their method for doing so was poorly configured SAS tokens…

Which also, in the article, is mentioned can not be tracked - issued tokens happen on the client side (if I understood this correctly), which means that to audit tokens you’d have to ask everyone who had one issued to politely provide said token. Will everyone remember the tokens they have? Probably not. And if an attacker has already gotten what they needed, or managed to issue their own, no one would know.

acdha2y ago

It didn’t seem to be focused on AI except for the very reasonable concerns that AI research involves lots of data and often also people without much security experience. Seeing things like personal computer backups in the dump immediately suggests that this was a quasi-academic division with a lot less attention to traditional IT standards: I’d be shocked if a Windows engineer could commit a ton of personal data, passwords, API keys, etc. and first hear about it from an outside researcher.

sneak2y ago

It was so common that S3 added several features to make it really, really hard to accidentally leave a whole bucket public.

Looks like Azure hasn't done similarly.

mcast2y ago

Is there any valid use case for when it's a good idea to publicly expose a S3 bucket?

sneak2y ago

Sharing of datasets, disk images, ISOs, ML models, etc, as well as public websites.

doctorpangloss2y ago

Cloud buckets have all sorts of toxic underdevelopment of features. They play make believe that they're file systems for adoption.

Like for starters, why is it so hard to determine effective access in their permissions models?

Why is the "type" of files so poorly modeled? Do I ever allow people to give effective public access to a file "type" that the bucket can't understand?

For example, what is the "type" of code? It doesn't have to be this big complex thing. The security scanners GitHub uses knows that there's a difference between code with and without "high entropy strings" aka passwords and keys. Or if it looks like data:content/type;base64, then at least I know it's probably an image.

What if it's weird binary files like .safetensors? Someone here saying you might "accidentally" release the GPT4 weights. I guess just don't let someone put those on a public-resolvable bucket, ever, without an explicit, uninherited manifest / metadata permitting that specific file.

Microsoft owns the operating system! I bet in two weeks, the Azure and Windows teams can figure out how to make a unified policy manifest / metadata for NTFS & ReFS files that Azure's buckets can understand. Then again, they don't give deduplication to Windows 11 users, their problem isn't engineering, it's the financialization of essential security features. Well jokes on you guys, if you make it a pain for everybody, you make it a pain for yourself, and you're the #1 user of Azure.

xbar2y ago

AI data is highly centralized and not stored in a serially-accessed database, which makes it unusual inasmuch as 40TB of interesting data does not often get put into a single storage bucket.

hdesh2y ago

On a lighter note - I saw a chat message that started with "Hey dude! How is it going". I'm disappointed that the response was not https://nohello.net/en/.

monkpit2y ago

I strongly support the “no hello” concept but I also fear being seen as “that guy” so I never mention it. Sigh

dymk2y ago

I've made peace with people sending me a bare "hello" with no context. I ignore it until there's something obvious to respond to. Responding with the "no hello" webpage will often be received as (passive) aggressive, and that's a bad way to start off a conversation.

Usually within a few minutes there's followup context sent. Either the other party was already in the process of writing the followup, or they realized there was nothing actionable to respond to and they elaborate.

monkpit2y ago

I should have a slack bot that replies automatically to generic greetings… that way they’ll get on with whatever the issue is and I won’t have to reply.

bornfreddy2y ago

Ha ha, that's a great idea!

A: Hello!

B's bot: Hello to you too! I am a chatty bot which loves responding to greetings. Is there a message I can forward to B?

tgsovlerkhgsel2y ago

"No hello" implies that people shouldn't be friendly at all, and comes across as rude.

The concept simply needs a more descriptive name to be accepted. It's not about not saying hello. It's about including the actual request in the first message, usually after the hello.

hiddencost2y ago

I make it my status message.

acdha2y ago

The people who need it aren’t the type of people who’d read it.

gaudystead2y ago

I made it my status message as well and all I got was a complaint passed along from my manager because somebody said that it was too rude and that I should be more gentle with my fellow corporate comrades...

version_five2y ago

I tried that on slack for a while, it made no difference. I don't think most people read the status message. The medium lends itself to the "Hi" type messages unfortunately, there's not really a way go constrain human nature, other than to not use instant messaging at all (I also tried changing my status to a note telling people to phone me, that didn't work either)

fireflash382y ago

I have seen people never ask their question after multiple days of saying "hello @user", despite having nohello as a status. And despite having asked them in the past to just ask their question and I'll respond when I can.

You just can't win.

cosmojg2y ago

I'd count that as a win. You avoided wasting your time answering a potentially inane question. If it were important, they would have asked.

sneak2y ago

Be that guy. In the long run it's better to be right then popular.

cosmojg2y ago

But then I might not survive the long run.

bootloop2y ago

This is quite funny for me because at first I didn't understand what the problem is.

In German, if you ask this question, it is expected that your question is genuine and you can expect an answer (Although usually people don't use this opportunity to unload there emotional package, but it can happen!)

Whereas in Englisch you assume this is just a hello and nothing more.

manojlds2y ago

In England people say "You all right" and move on without even waiting for a response!

qingcharles2y ago

In America it's even worse because they say "What's up?" in the same way we Brits say "Alright?", but "What's up?" to me like the person has detected something wrong with you and wants to know what the problem is. At least "Alright?" is more generally asking for your status.

Of course, both are generally rhetorical, which must be confusing for some foreigners learning English, especially with the correct response to "Alright?" being "Alright?" and similarly with "What's up?".

1 more reply

syndicatedjelly2y ago

I love that an entire website was made around this, without any attempt to sell me anything. So rare to see that these days

hahn-kev2y ago

Glad I've never had to deal with that in chat.

Though I have had the equivalent in tech support: "App doesn't work" which is basically just hello, obviously you're having an issue otherwise you wouldn't have contacted our support.

jovial_cavalier2y ago

Destroying comradery with a co-worker - Any % (WR)

low_tech_punk2y ago

Unfortunately, the AI researcher did not use a LLM to automatically respond the nohello content.

quickthrower22y ago

Two of the things that make me cringe are mentioned. Pickle files and SAS tokens. I get nervous dealing with Azure storage. Use RBAC. They should depreciate SAS and account keys IMO.

SOC2 type auditing should have been done here so I am surprised of the reach. Having the SAS with no expiry and then the deep level of access it gave including machine backups with their own tokens. A lot of lack of defence in depth going on there.

My view is burn all secrets. Burn all environment variables. I think most systems can work based on roles. Important humans access via username password and other factors.

If you are working in one cloud you don’t in theory need secrets. If not I had the idea the other day that proxies tightly couples to vaults could be used as api adaptors to convert then into RBAC too. But I am not a security expert just paranoid lol.

prmoustache2y ago

Many SOC2 audits are a joke. We were audited this year and were asked to provide screenshots of various categories (but most being of our own choosing in the end). Only requirement was screenshots needed to show date of the computer on which the screenshot had been taken, as if it couldn't be forged as well as the file/exif data.

lijok2y ago

If you forge your SOC2 evidence you will legitimately wish you were never born once caught

prmoustache2y ago

We aren't doing that. I just mention the lazyness of the auditors and that asking for screenshots is just dumb. At this point you can just ask a simply question: do you comply or not?

bunderbunder2y ago

Pickle files are cringe, but they're also basically unavoidable when working with Python machine learning infrastructure. None of the major ML packages provide a proper model serialization/deserialization mechanism.

In the case of scikit-learn, the code implementing some components does so much crazy dynamic shit that it might not even be feasible to provide a well-engineered serde mechanism without a major rewrite. Or at least, that's roughly what the project's maintainers say whenever they close tickets requesting such a thing.

osanseviero2y ago

You should check out safetensors. They are used widely in diffusion models and LLMs https://huggingface.co/blog/safetensors-security-audit

jklehm2y ago

ONNX[0], model-as-protosbufs, continuing to gain adoption will hopefully solve this issue.

[0] https://github.com/onnx/onnx

bunderbunder2y ago

ONNX is cool, but it still only supports a minority of scikit-learn components. Some of them simply aren't compatible with ONNX's basic design.

mxz30002y ago

at work we use the ONNX serialisation format for all of our prod models. Those get loaded by the ONNX runtime for inference. works great.

perhaps it's be viable to add support for the ONNX format even for use cases like model checkpointing during training, etc ?

hypeatei2y ago

Absolutely, RBAC should be the default. I would also advocate separate storage accounts for public-facing data, so that any misconfiguration doesn't affect your sensitive data. Just typical "security in layers" thinking that apparently this department in MSFT didn't have.

ozim2y ago

So SAS tokens are worse that some admin setting up "FileDownloaderAccount" and then sharing its password with multiple users or using the same for different applications?

I take SAS tokens with expiration over people setting up shared RBAC account and sharing password for it.

Yes people should do proper RBAC, but point a company and I will find dozens "shared" accounts. People don't care and don't mind. When beating them up with sticks does not solve the issue SAS tokens while still not perfect help quite a lot.

quickthrower22y ago

FileDownloaderAccount had no copy pastable secret that can be leaked. Shared passwords are unnecessary of course and not good. If people are going to do that just use OneDrive/Dropbox rather than letting people use advanced things.

stevanl2y ago

Looks like it was up for 2 years with that old link[1]. Fixed two months ago.

[1] https://github.com/microsoft/robust-models-transfer/blame/a9...

jl62y ago

Kind of incredible that someone managed to export Teams messages out from Teams…

pradn2y ago

It's not reasonable to expect human security token generation to be perfectly secure all the time. The system needs to be safe overall. The organization should have set an OrgPolicy on this entire project to prevent blanket sharing of auth tokens/credentials like this. Ideally blanket access tokens should be opt-in, not opt-out.

Google banned generation of service account keys for internally-used projects. So an awry JSON file doesn't allow access to Google data/code. This is enforced at the highest level by OrgPolicy. There's a bunch more restrictions, too.

mola2y ago

It's always funny that wiz's big security revelations are almost always about Microsoft. When wiz's founder was the highest ranking in charge of cyber security at Microsoft in his previous job .

alphabetting2y ago

Would be kind of surprising if that weren't the case.

anon11990222y ago

Just proves how hard it cloud security now. 1-2 mistake and you expose TB's. Insane.

tombert2y ago

My wife and I just rewatched WarGames for the millionth time a few nights ago.

The level of cybersecurity incompetency in the early 80's makes sense; computers (and in particular networked computers) were still relatively new, and there weren't that many external users to begin with, so while the potential impact of a mistake was huge (which of course was the plot of the movie), the likelihood of a horrible thing happening was fairly low just because computers were an expensive, somewhat niche thing.

Fast forward to 2023, and now everyone owns bunches of computers, all of which are connected to a network, and all of which are oodles more powerful than anything in the 80s. Cybersecurity protocols are of course much more mature now, but there's also several orders of magnitude more potential attackers than there were in the 80s.

viraptor2y ago

> Cybersecurity protocols are of course much more mature now

At technical level, sure. At the deployment, configuration and management level, not quite. Overall things are so bad that news aren't even reporting the hospitals taken over by ransomware anymore. It's still happening almost every week and we're just... used to it.

LinuxBender2y ago

That modem setup in Wargames is still a thing for many organizations including some banks and telcos. Not naming names but I suspect the modems will be around for a very long time. Some have a password on their modem but they are usually very simple. Their only saving grace is that they are usually in front of a mainframe speaking proprietary MML that only old fuddy duddies like me would remember. There are a few of us here

rft2y ago

> proprietary MML that only old fuddy duddies like me would remember.

Security through obscurity helps only until someone gets curious/determined. I have a personal anecdote for that. During university I was involved in pentesting an industrial control system (not in an industrial context, but same technology) and implemented a simple mitm attack to change the state of the controls while displaying the operator selected state. When talking with the responsible parties, they just assumed that the required niche knowledge means the attack is not feasible. I had the first dummy implementation setup on the train ride home based only on network captures. Took another day to fine tune once I got my hands on a proper setup and worked fine after that.

I do not want to say that ModbusTCP is in the same league as MML, but if there is interest in it, someone will figure it out. Sure, you might not be on Shodan, but are the standard/scripted attacks really what you should worry about? Also don't underestimate a curious kid who nerdsnipes themself into figuring that stuff out.

1 more reply

FridayNightTV2y ago

> I suspect the modems will be around for a very long time.

No they won't.

'Dial up' modems need a PSTN line to work. The roll out of full fibre networks means analogue PSTN is going the way of the dodo. You cannot get a new PSTN line anymore in Blighty. In Estonia and the Netherlands (IIRC) the PSTN switch off is already complete.

3 more replies

heywhatupboys2y ago

what does this have to do with a "modem" per se?

1 more reply

k12sosse2y ago

> wardialing

Get a load these guys honey, you could just dial straight into the airline.

photoGrant2y ago

Hard coded secrets in shareable URL’s with almost infinite time windows and an untraceable ability to audit what’s made and shared and at what level?

Sounds like it’s as hard as it’s always been. Pretty basic and filled with humans

LeifCarrotson2y ago

I feel like it's harder.

It's no longer hierarchical, with organization schemes limited to folders and files. People no longer talk about network paths, or server names.

Mobile and desktop apps alike go to enormous effort to abstract and hide the location at which a document gets stored, instead everything is tagged and shared across buckets and accounts and domains...

I expect that the people at this organization working on cutting-edge AI are pretty sharp, but it's no surprise that they don't entirely understand the implications of "SAS tokens" and "storage containers" and "permissive access scope" on Azure, and the differences between Account SAS, Service SAS, and User Delegation SAS. Maybe the people at Wiz.io are sharper, but unless I missed the sarcasm, they may be wrong when they say [1] "Generating an Account SAS is a simple process." That looks like a really complicated process!

We just traced back an issue where a bunch of information was missing from a previous employee's projects when we changed his account to a shared mailbox. Turns out that he'd inadvertently been saving and sharing documents from his individual OneDrive on O365 (There's not one drive! There are many! Stop trying to pretend there's only one drive!) instead of the "official" organization-level project folder, and had weird settings on his laptop that pointed every "Save" operation at that personal folder, requiring a byzantine procedure to input a real path to get back to the project folder.

[1]: https://i.imgur.com/6V7VLLd.png

eitland2y ago

> but unless I missed the sarcasm, they may be wrong when they say [1] "Generating an Account SAS is a simple process." That looks like a really complicated process!

No, unless I understand actually it is intended to be understood the other way:

It is too easy to create a to broad token.

And in the next paragraph, after the image, they explain that in addition to it being easy to create, these tokens are impossible to audit.

formerly_proven2y ago

This stands out

> Our scan shows that this account contained 38TB of additional data — including Microsoft employees’ personal computer backups.

Not even Microsoft has functioning corporate IT any more, with employees not just being able to make their own image-based backups, but also having to store them in some random A3 bucket that they're using for work files.

croes2y ago

Why not even?

Security was never a strong part of Microsoft.

bkm2y ago

Would be insane if the GPT4 model is in there somewhere (as its served by Azure).

albert_e2y ago

Also imagine all such exposed data sources including those that are not yet discovered... are crawled and trained on by GPT5.

Meanwhile a big enterprise provider like MS suffers a bigger leak and exposes MS Teams/ OneDrive / SharePoint data of all its North America customers say.

Boom we have GPT model that can autonomously run whole businesses.

naillo2y ago

Well there is that "transformers" folder at the bottom of the screenshot...

wodenokoto2y ago

I really dislike how Azure makes you juggle keys in order to make any two Azure things talk together.

Even more so, you only have two keys for the entire storage account. Would have made much more sense if you could have unlimited, named keys for each container.

unoti2y ago

> I really dislike how Azure makes you juggle keys in order to make any two Azure things talk together.

Actually there is a better way. Look into “Managed Identity”. This allows you to grant access from one service to another, for example grant access to allow a specific VM to work with your storage account.

bob10292y ago

This is what we are using for everything. It makes life so much easier.

So far, our new Azure tenant has absolutely zero passwords or shared secrets to keep track of.

Granting a function app access to SQL Server by way of the app's name felt like some kind of BS magic trick to me at first. But it absolutely works. Experiences like this give me hope for the future.

PretzelPirate2y ago

> if you could have unlimited, named keys for each container.

These exist and are called Shared Access Tokens. People are too lazy to use them and just use the account-wide keys instead.

quickthrower22y ago

https://learn.microsoft.com/en-us/azure/role-based-access-co...

kevinsundar2y ago

This is very similar to how some security researchers got access to TikTok's S3 bucket: https://medium.com/berkeleyischool/cloudsquatting-taking-ove...

They used the same mechanism of using common crawl or other publicly available web crawler data to source dns records for s3 buckets.

EGreg2y ago

This seems to be a common occurrence with Big Tech and Big Government, so we better get used to it:

https://qbix.com/blog/2023/06/12/no-way-to-prevent-this-says...

https://qbix.com/blog/2021/01/25/no-way-to-prevent-this-says...

alphabetting2y ago

Is this stuff regularly happening to AWS and GCP? This is like the 3rd insane security incident from Microsoft in the past year.

EGreg2y ago

https://www.bleepingcomputer.com/news/security/top-secret-us...

https://www.engadget.com/amp/2018-07-18-robocall-exposes-vot...

Ok so it’s not Microsoft exposing Microsoft, but government exposing its S3 buckets.

The question should be — why is all that data and power concentrated in one place? Because of the capitalist system and Big Tech, or Big Government.

Personally I am rather happy when “top secret information” is exposed, because that I s the type of thing that harms people around the world more than it helps. The government wants to know who is sending you $600 but doesnt want to tell you how they spent trillions on shadowy “defense” contractors.

https://community.qbix.com/t/transparency-in-government/234

rickette2y ago

At this point MS might as well aquire Wiz, given the number of azure security findings they have found.

lijok2y ago

I wouldn't trust MSFT with my glass of chocolate milk at this point. I would come back to lipstick all over the rim and somehow multiple leaks in the glass

gumballindie2y ago

Would be cool if someone analysed - i am fairly certain it has proprietary code and data laying around. Would be useful for future lawsuits against microsoft and others that steal people’s ip for “training” purposes.

madelyn-goodman2y ago

This is so unfortunate but a clear illustration of something I've been thinking about a lot when it comes to LLMs and AI. It seems like we're forgetting that we are just handing our data over to these companies on a solver platter in the form of our prompts. Disclosure that I do work for Tonic.ai and we are working on a way to automatically redact any information you send to an LLM - https://www.tonic.ai/solar

naikrovek2y ago

Amazing how ingrained it is in some people to just go around security controls.

someone chose to make that SAS have a long expiry and someone chose to make it read-write.

JohnMakin2y ago

It’s easy.

“ugh, this thing needs to get out by end of week and I can’t scope this key properly, nothing’s working with it.”

“just give it admin privileges and we’ll fix it later”

sometimes they’ll put a short TTL on it, aware of the risk. Then something major breaks a few months later, gets a 15 year expiry, never is remediated.

It’s common because it’s tempting and easy to tell yourself you’ll fix it later, refactor, etc. But then people leave, stuff gets dropped, and security is very rarely a priority in most orgs - let alone remediation of old security issues.

baz002y ago

What's that, the second major data loss / leak event from MSFT recently.

Is your data really safe there?

h1fra2y ago

The article is focusing on AI and teams messages for some reason, but the exposed bucket had password, ssh keys, credentials, .env and most probably a lot of proprietary code. I can't even imagine the nightmare it has created internally.

svaha17282y ago

Embrace, extend, and extinguish cybersecurity with AI. It's the Microsoft way.

fithisux2y ago

My opinion is that it was not an "accident", but they prepare us for the era where powerful companies will "own" our data in the name of security.

Should have been sent to prison.

riwsky2y ago

If only Microsoft hadn’t named the project “robust” models transfer, they could have dodged this Hubrisbleed attack.

bt1a2y ago

Don't get pickled, friends!

346792y ago

@4mm character width:

4e-6 * 3.8e+13 = 152 million kilometers of text.

Nearly 200 round trips to the moon.

avereveard2y ago

Oof. Is that containing code from GitHub private repos?

endisneigh2y ago

how is this sort of stuff not at least encrypted at rest?

tremon2y ago

Encryption at rest does nothing to prevent online access to data. It's only useful if you leave your storage cabinet standing on the side of the road.

quickthrower22y ago

Your laptop backup could be encrypted. New problem: where to out the keys. Maybe another storage account with different access controls.

pixl972y ago

> New problem: where to out the keys.

If it's windows, Active Directory.

Smaug1232y ago

Per the article, the Azure bucket was explicitly shared. Azure Storage is generally encrypted at rest (https://learn.microsoft.com/en-us/azure/storage/common/stora...).

nightpool2y ago

What do you think "encryption at rest" means

mymac2y ago

Fortunately not a whole of of data and for sure with a little bit like that there wasn't anything important, confidential or embarrassing in there. Looking forward to Microsoft's itemised list of what was taken, as well as their GDPR related filing.

Nischalj102y ago

zsh, any way to download the stuff?

EMCymatics2y ago

That's a lot of data.

munchler2y ago

> This case is an example of the new risks organizations face when starting to leverage the power of AI more broadly, as more of their engineers now work with massive amounts of training data.

It seems like a stretch to associate this risk with AI specifically. The era of "big data" started several years before the current AI boom.

numbsafari2y ago

This is the risk of using, checks notes, Azure and working with Microsoft.

Except there is no risk for them. They've proven time and again they have major security snafus and not be held accountable.

eddythompson802y ago

Virtual networks are a nightmare to setup and manage in Azure which is why everyone just takes the easy path and not bother.

Almost every Azure service we deal with has virtual networks as an after thought because they want to get to market as quickly as possible, and even to them managing vnets is a nightmare.

Not to excuse developers/users though. There are plenty of unsecured S3 buckets, docker containers, and Github repos that expose too much "because it's easier". I've had a developer checkin their ftp creds into a repo the whole company has access to. He even broke the keys up and concat them in shell to work around the static checks "because it's easier" for their dev/test flow.

robertlagrant2y ago

They have all the regulatory paperwork in place, so it must be fine.

datavirtue2y ago

They are also the top line investment for the majority of mutual and pension funds. Don't crab too much, they are funding your retirement.

intrasight2y ago

Agreed. It should say "new risks organizations face when starting to leverage the power of Azure" or "the power of cloud computing". But as clickbait worthy a title.

acdha2y ago

The second clause covers that: this isn’t an AI problem, just as it wasn’t a big data problem when the same kinda of things happened a decade ago. It’s a problem caused when you set up something new outside of what the organization is used to and have people without appropriate training asked to make security decisions: I’d bet that this work was being done by people who were used to the academic style, blending personal and corporate use on the same device, etc. and simply weren’t thinking of this class of problem. The description sounds a lot like the grad students & postdocs I used to support – you’d see some dude with Steam on his workstation because it faster than his laptop and since he was in the lab 70 hours a week anyway, why not 90?

The challenge for organizations is figuring out how to support research projects and other experiments without opening themselves up to this kind of problem or stymieing R&D.

omgJustTest2y ago

This comment is a good bit of rationalization, and whichever the categorical mismatch you feel is happening, it misses the overarching point, the focus should be on the broader systemic issues: data security is not a first or second tier priority to "big data" or "AI"... largely because there's no cost to doing it poorly.

mavhc2y ago

With big data comes big responsibility

Phileosopher2y ago

AI has magnified the use cases, though. Before, Big Data was an advertising machine meant to tokenize and market to every living being on the planet. Now, machine learning can create "averaged" behavior of just about anything, given enough data and specificity.

buro92y ago

Part of me thought "this is fine as very few could actually download 38TB".

But that's not true as it's just so cheap to spin up a machine and some storage on a Cloud provider and deal with it later.

It's also not true as I've got a 1Gbps internet connection and 112TB usable in my local NAS.

All of a sudden (over a decade) all the numbers got big and massive data exfiltration just looks to be trivial.

I mean, obviously that's the sales pitch... you need this vendor's monitoring and security, but that's not a bad sales pitch as you need to be able to imagine and think of the risk to monitor for it and most engineers aren't thinking that way.

sillysaurusx2y ago

How do you have your NAS configured? The more specifics, the better; I’ve wanted one.

Do you worry about failure? In your hardware life I mean, not your personal life.

NikolaNovak2y ago

Not the OP, but after a lot of messing with software software and OS RAID, Raid Cards and mother boards, dedicated loud Dell servers, UnRAID, this that and the other thing over years and decades, I just set up a big Synology device 5 years ago. Since then, I've had a NAS that just worked. I have data, it's there.

I do online backup to a cloud provider, and a monthly dump to external USB drives that I keep and rotate at my mother in law's house (off site:).

More than any technical advice, I'd strongly urge you to check and understand honestly whether you're looking for "NAS" (a place to seamlessly store data) or "a project" (something to spend fun and frustrating and exciting evening and weekend time configuring, upgrading, troubleshooting, changing, re-designing, replacing, blogging, etc). Nothing wrong with either, just ensure you pick the path you actually want :->

sillysaurusx2y ago

Which model Synology do you have? (Would you still make the same choice today?)

Did you settle on using RAID, or just rely on cloud backups?

1 more reply

aftbit2y ago

Not the OP but I have a pair of Chenbro NR12000 1U rack mount servers, bought for about $120 each on eBay a few years ago. Each has 12 internal 3.5" mounting points and 14 SATA cables. In one server, I have 12 4TB used enterprise drives. In the other, I have 12 8TB drives. Both have 16 GB of RAM (should probably be more) and two 2.5" SATA SSDs. They are configured with two ZFS raidz1 vdevs, each made up of 6 disks. This gives me 10 usable disks and 2 used for parity, and the ability to survive at least one failure but maybe two (if I'm lucky).

I back up critical data from the 80TB NAS to the 40TB NAS, and the most critical data gets backed up nightly to a single hard drive in my friend's NAS box (offsite). Twice a year, I back up the full thing to external hard drives and take them out of state to a different friend's house.

Don't worry, be happy.

sillysaurusx2y ago

(Where are you finding friends with a NAS? Or at all, for that matter… guess I’ll look on eBay.)

Thank you for the details, particularly about zfs, which I know nothing about. The “if I’m lucky” part piqued my interest. HN was recently taken down by a double disk failure, which is exponentially more likely when you buy drives in bulk - the default case. So being able to survive two failures simultaneously is something I’d like to design for.

It’s cool you have two NASes (NASen?) let alone one. They’re the Pokémon of the tech world.

2 more replies

2f0ja2y ago

What are you criteria for used enterprise drives? I'm wading into building a nas (well.. it's more of a 'project' nas as an above comment would say) and I'm getting a little lost in the sauce about drives.

1 more reply

fnordpiglet2y ago

I use a Ubuntu raspberry pi with a cheap usb3 jbod array from Amazon that can hold 5 HDD. I use zfs on it in raidz1. It’s absurdly cheap, can serve about 80 Mb/s on a 1 gbps link, and is entirely sufficient for local backup. I don’t do any offsite. Set up to back up time machine, windows, and zrepl. Runs other services on the pi as well for the home network.

It’s so easy to set up an Ubuntu image that I control completely and I would rather do that than run some questionable 3rd party NAS solution and excluding disks costs about $130.

daggersandscars2y ago

Not the original poster, but to add my experience:

Two-bay NAS, two drives as a mirrored pair, two SSDs as mirrored pair cache. Only makes data available on my home network. Primarily using Nextcloud and Gitea.

It backs up important files nightly to a USB-attached drive, less critical files weekly. I have a weekly backup to a cloud provider for critical files.

A sibling comment makes a good point: do you want a hobby or an appliance? Using a commercial NAS makes it closer to an appliance[0]. Building it yourself will likely require more fiddling.

If you want to run a different OS on a commercial NAS, dig deeper into the OS requirements before buying a the NAS. Asustor Lockerstor Gen 2 series' fan is not inherently supported by things other than Asustor's software.

[0] A commercial NAS will still require monitoring, maintenance, and validation of backups.

buro92y ago

I just have a Synology DS1821+ which has (8 * HDD bays) + (2 * M2 slots). The bays I've filled with 18TB HDDs (I chose Toshiba N300 as they do not use SMR). The M2 slots I've put a couple of 1TB M2 drives in as an SSD cached (they better allow the HDDs to hibernate for frequently accessed files like music).

I've got these in an SHR configuration (Synology Hybrid Raid with 1 disk of protection) which means about 115-6TB of usable space and allowing for single drive failure.

The filesystem is BTRFS ( https://daltondur.st/syno_btrfs_1/ ).

I upgraded the RAM (Synology will forever nag about it not being their RAM https://www.reddit.com/r/synology/comments/kaq7ks/how_to_dis... ).

I have the option in future to purchase the network card to take that to 10Gbps ports rather than 1Gbps ports.

So that's the first... but then I have a second one... which is an older DS1817+ which is filled with 10TB HDDs and yields 54.5TB usable in SHR2 + BTRFS... which I use as a backup to the first, but as it's smaller just the really important stuff and it is disconnected and powered down mostly, it's a monthly chore to connect it, and rsync things over. Typically if I want to massively expand a NAS (every - 10 years) I will buy a whole new one and relegate the existing to be a backup device. Meaning an enclosure has on avg about 15y of life in it and amortises really well as being initially the primary, and then later the backup.

I do _not_ use any of the Synology software, it's just a file system... I prefer to keep my NAS simple and offload any compute to other small devices/machines. This is in part because of the length of time I keep these things in service... the software is nearly always the weakest link here.

You can build your own NAS, TrueNAS Core (nee FreeNAS) https://www.truenas.com/freenas/ is very good... but for me, a NAS is always on and the low power performance of this purpose built devices and their ability to handle environmental conditions (I am not doing anything special for cooling, etc) and the long-term updates to the OS, etc... makes it quite compelling.

darknavi2y ago

Unraid is a pretty friendly OS with easy disk adoption and nice gui for managing docker containers.

You can have up to two disks of redundancy (dual parity) per drive pool.

int0x2e2y ago

It's much worse - if the data isn't just a ton of tiny files, and you're able to spin up a bunch of workers for parallelism, you can get up to 120 Gbps per storage account (without going to the extreme of requiring a special quota increase).

That means in a little bit over 5 minutes, the data could have been downloaded by someone. Even most well run security teams won't be able to respond quickly enough for that type of event.

koolba2y ago

At the rack rates of $.05/GB, that’d come out to $1,945 per copy that’s downloaded. So not only do you have the breach, you also have a fat bill too.

redox992y ago

> $.05/GB

That's just a scam rate by AWS. The true price is 1/100th of that, if that.

ltbarcly32y ago

Agree, this is extremely dubious:

5gbps and 10gbps residential fiber connections are common now.

12TB hd's cost under $100, so you would only need about $400 of storage to capture this, my SAN has more capacity than this and I bought basically the cheapest disks I could for it.

It only takes one person to download it and make a torrent for it to be spread arbitrarily.

People could target more interesting subsets over less interesting parts of the data.

Multiple downloaders could share what they have and let an interested party assemble what is then available.

spullara2y ago

Not really a sales pitch as it wasn't discovered by their product but rather by their security team doing a bunch of manual work.

zooFox2y ago

The article mentions that it wasn't a read-only token, meaning you could at least edit and delete files too.

byteknight2y ago

Trivial in a technical sense but monitoring capabilities (hopefully) have increased in kind.

permo-w2y ago

with a 1Gbps connection you're still looking at ~248 hours to download, and that's if the remote server can keep up, which it almost certainly can't

this is assuming by 1Gbps you mean 1 Gigabit/s rather than 1 Gigabyte/s

mlyle2y ago

Not sure where 248 hours came from.

38 terabytes = 304 terabits.

304 terabits / 1 gigabit/second = 304,000 seconds

304,000 seconds =~ 84 hours. Add 20% for not pegging the line the whole time and the limits of 1gbps ethernet, and perhaps 100 hours is reasonable.

permo-w2y ago

my mistake, I swapped the 38tb and 112tb from parent comment

whatever the download size is, you're bottlenecked by the remote server's up speed

1 more reply

flakeoil2y ago

But you don't need to download everything. Even 1/10th of that could be juicy enough. Or 1/100th.

anyoneamous2y ago

Straight to jail.

1-62y ago

Nah, Microsoft probably has a blameless culture

croes2y ago

It was hackers, for sure.

HumblyTossed2y ago

Microsoft, too big to fa.. care.

j / k navigate · click thread line to collapse

226 comments

saurik2y ago