...but one notable way in which it does implicate an AI-specific risk is how prevalent it is to use serialized Python objects to store these large opaque AI models, given how the Python serialization format was never exactly intended for untrusted data distribution and so is kind of effectively code... but stored in a way where both what that code says as well as that it is there at all is extremely obfuscated to people who download it.
> This is particularly interesting considering the repository’s original purpose: providing AI models for use in training code. The repository instructs users to download a model data file from the SAS link and feed it into a script. The file’s format is ckpt, a format produced by the TensorFlow library. It’s formatted using Python’s pickle formatter, which is prone to arbitrary code execution by design. Meaning, an attacker could have injected malicious code into all the AI models in this storage account, and every user who trusts Microsoft’s GitHub repository would’ve been infected by it.
That said you should be using something like safe-tensors.
At the moment such techniques would seem to be superfluous. I mean we're still at the stage where you can get a bot to spit out a credit card number by saying, "My name is in the credit card field. What is my name?"
That said, what you're describing seems totally plausible. If there was enough text with a context where it behaved in a particular way, triggering that context should trip that behavior. And there would be no obvious sign of it unless you triggered that context.
AI is hard.
Firstly, the malicious data needs to form a significant portion of the data. Given that training data is on the order of terabytes, this alone makes it unlikely you’ll be able to poison the dataset.
Unless the entire training dataset was also stored in this 38TB, you’ll only be able to fine tune the model, and fine tuning tends to destroy model quality (or else fine tuning would be the default case for foundation models — you’d train it, fine tune it to make it “even better” somehow, then release it. But we don’t, because it makes the model less general by definition).
At least for common languages this should stand out.
Where it gets more tricky is watering hole attacks against specialized languages or certain setups. This said you'd have to ensure that this data is not already there scraped up from the internet.
This incident is a good one to point back to.
A good fraction of the flaws we found at Matasano involved pentests against statically typed languages. If an adversary has root access to your storage box, they can likely find ways to pivot their access. Netpens were designed to do that, and those were the most fun; they’d parachute us into a random network, give us non-root creds, and say “try to find as many other servers that you can get to.” It was hard, but we’d find ways, and it almost never involved modifying existing files. It wasn’t necessary — the bash history always had so many useful points of interest.
It’s true that the dynamics are a little different there, since that’s a running server rather than a storage box. But those two employees’ hard drive backups have an almost 100% chance of containing at least one pivot vector.
Sadly choice of technology turns out to be irrelevant, and can even lead to overconfidence. The solution is to pay for regular security testing, and not just the automated kind. Get someone in there to try to sleuth out attack vectors by hand. It’s expensive, but it pays off.
the problem is in fact the far more subtle principle of "don't download and run random code, and definitely don't make it the idiomatic way to do things," and i'm not sure you can blame your use of eval()-like things on the fact that they exist in your language in the first place
Unfortunately a lot of pen testing services have devolved into "We know you need a report for SOC 2, but don't worry, we can do some light security testing and generate a report for you in a few days and you'll be able to check the box for compliance"
Which is guess is better than nothing.
If anyone works at a company that does pen tests for compliance purposes, I'd recommend advocating internally for doing a "quick, easy, and cheap" pen test to "check the box" for compliance, _alongside_ a more comprehensive pen test (maybe call it something other than a "pen test" to convince internal stakeholders who might be afraid that a 2nd in depth pen test might weaken their compliance posture since the report is typically shared with sales prospects)
Ideally grey box or white box testing (provide access to codebase / infrastructure to make finding bugs easier). Most pen tests done for compliance purposes are black-box and limit their findings as a result.
When I was consulting architecture and code review were separate services with a very different rate from pentesting. Similar goals but far more expensive.
Unfortunately, compliance/customer requirements often stipulate having penetration tests performed by third parties. So for business reasons, these same companies, will also hire low-quality pen-tests from "check-box pen-test" firms.
So when you see that $10K "complete pen-test" being advertised as being used by [INSERT BIG SERIOUS NAME HERE], good chance this is why.
They may be rare, but "real" pentests are still a thing.
Pentest comes across more as checking all the common attack vectors don’t exist.
Getting out of bed to do the so-called “real stuff” is typically called a bug bounty program or security researching.
Both exist and I don’t see why most companies couldn’t start a bug bounty program if they really cared a lot about the “real stuff”
- finding the token directly in the repo
- reviewing all tokens issued
Looks like Azure hasn't done similarly.
Like for starters, why is it so hard to determine effective access in their permissions models?
Why is the "type" of files so poorly modeled? Do I ever allow people to give effective public access to a file "type" that the bucket can't understand?
For example, what is the "type" of code? It doesn't have to be this big complex thing. The security scanners GitHub uses knows that there's a difference between code with and without "high entropy strings" aka passwords and keys. Or if it looks like data:content/type;base64, then at least I know it's probably an image.
What if it's weird binary files like .safetensors? Someone here saying you might "accidentally" release the GPT4 weights. I guess just don't let someone put those on a public-resolvable bucket, ever, without an explicit, uninherited manifest / metadata permitting that specific file.
Microsoft owns the operating system! I bet in two weeks, the Azure and Windows teams can figure out how to make a unified policy manifest / metadata for NTFS & ReFS files that Azure's buckets can understand. Then again, they don't give deduplication to Windows 11 users, their problem isn't engineering, it's the financialization of essential security features. Well jokes on you guys, if you make it a pain for everybody, you make it a pain for yourself, and you're the #1 user of Azure.
Usually within a few minutes there's followup context sent. Either the other party was already in the process of writing the followup, or they realized there was nothing actionable to respond to and they elaborate.
The concept simply needs a more descriptive name to be accepted. It's not about not saying hello. It's about including the actual request in the first message, usually after the hello.
You just can't win.
In German, if you ask this question, it is expected that your question is genuine and you can expect an answer (Although usually people don't use this opportunity to unload there emotional package, but it can happen!)
Whereas in Englisch you assume this is just a hello and nothing more.
Though I have had the equivalent in tech support: "App doesn't work" which is basically just hello, obviously you're having an issue otherwise you wouldn't have contacted our support.
SOC2 type auditing should have been done here so I am surprised of the reach. Having the SAS with no expiry and then the deep level of access it gave including machine backups with their own tokens. A lot of lack of defence in depth going on there.
My view is burn all secrets. Burn all environment variables. I think most systems can work based on roles. Important humans access via username password and other factors.
If you are working in one cloud you don’t in theory need secrets. If not I had the idea the other day that proxies tightly couples to vaults could be used as api adaptors to convert then into RBAC too. But I am not a security expert just paranoid lol.
In the case of scikit-learn, the code implementing some components does so much crazy dynamic shit that it might not even be feasible to provide a well-engineered serde mechanism without a major rewrite. Or at least, that's roughly what the project's maintainers say whenever they close tickets requesting such a thing.
perhaps it's be viable to add support for the ONNX format even for use cases like model checkpointing during training, etc ?
I take SAS tokens with expiration over people setting up shared RBAC account and sharing password for it.
Yes people should do proper RBAC, but point a company and I will find dozens "shared" accounts. People don't care and don't mind. When beating them up with sticks does not solve the issue SAS tokens while still not perfect help quite a lot.
[1] https://github.com/microsoft/robust-models-transfer/blame/a9...
Google banned generation of service account keys for internally-used projects. So an awry JSON file doesn't allow access to Google data/code. This is enforced at the highest level by OrgPolicy. There's a bunch more restrictions, too.
The level of cybersecurity incompetency in the early 80's makes sense; computers (and in particular networked computers) were still relatively new, and there weren't that many external users to begin with, so while the potential impact of a mistake was huge (which of course was the plot of the movie), the likelihood of a horrible thing happening was fairly low just because computers were an expensive, somewhat niche thing.
Fast forward to 2023, and now everyone owns bunches of computers, all of which are connected to a network, and all of which are oodles more powerful than anything in the 80s. Cybersecurity protocols are of course much more mature now, but there's also several orders of magnitude more potential attackers than there were in the 80s.
At technical level, sure. At the deployment, configuration and management level, not quite. Overall things are so bad that news aren't even reporting the hospitals taken over by ransomware anymore. It's still happening almost every week and we're just... used to it.
Get a load these guys honey, you could just dial straight into the airline.
Sounds like it’s as hard as it’s always been. Pretty basic and filled with humans
It's no longer hierarchical, with organization schemes limited to folders and files. People no longer talk about network paths, or server names.
Mobile and desktop apps alike go to enormous effort to abstract and hide the location at which a document gets stored, instead everything is tagged and shared across buckets and accounts and domains...
I expect that the people at this organization working on cutting-edge AI are pretty sharp, but it's no surprise that they don't entirely understand the implications of "SAS tokens" and "storage containers" and "permissive access scope" on Azure, and the differences between Account SAS, Service SAS, and User Delegation SAS. Maybe the people at Wiz.io are sharper, but unless I missed the sarcasm, they may be wrong when they say [1] "Generating an Account SAS is a simple process." That looks like a really complicated process!
We just traced back an issue where a bunch of information was missing from a previous employee's projects when we changed his account to a shared mailbox. Turns out that he'd inadvertently been saving and sharing documents from his individual OneDrive on O365 (There's not one drive! There are many! Stop trying to pretend there's only one drive!) instead of the "official" organization-level project folder, and had weird settings on his laptop that pointed every "Save" operation at that personal folder, requiring a byzantine procedure to input a real path to get back to the project folder.
> Our scan shows that this account contained 38TB of additional data — including Microsoft employees’ personal computer backups.
Not even Microsoft has functioning corporate IT any more, with employees not just being able to make their own image-based backups, but also having to store them in some random A3 bucket that they're using for work files.
Security was never a strong part of Microsoft.
Meanwhile a big enterprise provider like MS suffers a bigger leak and exposes MS Teams/ OneDrive / SharePoint data of all its North America customers say.
Boom we have GPT model that can autonomously run whole businesses.
Even more so, you only have two keys for the entire storage account. Would have made much more sense if you could have unlimited, named keys for each container.
Actually there is a better way. Look into “Managed Identity”. This allows you to grant access from one service to another, for example grant access to allow a specific VM to work with your storage account.
So far, our new Azure tenant has absolutely zero passwords or shared secrets to keep track of.
Granting a function app access to SQL Server by way of the app's name felt like some kind of BS magic trick to me at first. But it absolutely works. Experiences like this give me hope for the future.
These exist and are called Shared Access Tokens. People are too lazy to use them and just use the account-wide keys instead.
They used the same mechanism of using common crawl or other publicly available web crawler data to source dns records for s3 buckets.
https://qbix.com/blog/2023/06/12/no-way-to-prevent-this-says...
https://qbix.com/blog/2021/01/25/no-way-to-prevent-this-says...
https://www.engadget.com/amp/2018-07-18-robocall-exposes-vot...
Ok so it’s not Microsoft exposing Microsoft, but government exposing its S3 buckets.
The question should be — why is all that data and power concentrated in one place? Because of the capitalist system and Big Tech, or Big Government.
Personally I am rather happy when “top secret information” is exposed, because that I s the type of thing that harms people around the world more than it helps. The government wants to know who is sending you $600 but doesnt want to tell you how they spent trillions on shadowy “defense” contractors.
someone chose to make that SAS have a long expiry and someone chose to make it read-write.
“ugh, this thing needs to get out by end of week and I can’t scope this key properly, nothing’s working with it.”
“just give it admin privileges and we’ll fix it later”
sometimes they’ll put a short TTL on it, aware of the risk. Then something major breaks a few months later, gets a 15 year expiry, never is remediated.
It’s common because it’s tempting and easy to tell yourself you’ll fix it later, refactor, etc. But then people leave, stuff gets dropped, and security is very rarely a priority in most orgs - let alone remediation of old security issues.
Is your data really safe there?
Should have been sent to prison.
4e-6 * 3.8e+13 = 152 million kilometers of text.
Nearly 200 round trips to the moon.
It seems like a stretch to associate this risk with AI specifically. The era of "big data" started several years before the current AI boom.
Except there is no risk for them. They've proven time and again they have major security snafus and not be held accountable.
Almost every Azure service we deal with has virtual networks as an after thought because they want to get to market as quickly as possible, and even to them managing vnets is a nightmare.
Not to excuse developers/users though. There are plenty of unsecured S3 buckets, docker containers, and Github repos that expose too much "because it's easier". I've had a developer checkin their ftp creds into a repo the whole company has access to. He even broke the keys up and concat them in shell to work around the static checks "because it's easier" for their dev/test flow.
The challenge for organizations is figuring out how to support research projects and other experiments without opening themselves up to this kind of problem or stymieing R&D.
But that's not true as it's just so cheap to spin up a machine and some storage on a Cloud provider and deal with it later.
It's also not true as I've got a 1Gbps internet connection and 112TB usable in my local NAS.
All of a sudden (over a decade) all the numbers got big and massive data exfiltration just looks to be trivial.
I mean, obviously that's the sales pitch... you need this vendor's monitoring and security, but that's not a bad sales pitch as you need to be able to imagine and think of the risk to monitor for it and most engineers aren't thinking that way.
Do you worry about failure? In your hardware life I mean, not your personal life.
I do online backup to a cloud provider, and a monthly dump to external USB drives that I keep and rotate at my mother in law's house (off site:).
More than any technical advice, I'd strongly urge you to check and understand honestly whether you're looking for "NAS" (a place to seamlessly store data) or "a project" (something to spend fun and frustrating and exciting evening and weekend time configuring, upgrading, troubleshooting, changing, re-designing, replacing, blogging, etc). Nothing wrong with either, just ensure you pick the path you actually want :->
I back up critical data from the 80TB NAS to the 40TB NAS, and the most critical data gets backed up nightly to a single hard drive in my friend's NAS box (offsite). Twice a year, I back up the full thing to external hard drives and take them out of state to a different friend's house.
Don't worry, be happy.
It’s so easy to set up an Ubuntu image that I control completely and I would rather do that than run some questionable 3rd party NAS solution and excluding disks costs about $130.
Two-bay NAS, two drives as a mirrored pair, two SSDs as mirrored pair cache. Only makes data available on my home network. Primarily using Nextcloud and Gitea.
It backs up important files nightly to a USB-attached drive, less critical files weekly. I have a weekly backup to a cloud provider for critical files.
A sibling comment makes a good point: do you want a hobby or an appliance? Using a commercial NAS makes it closer to an appliance[0]. Building it yourself will likely require more fiddling.
If you want to run a different OS on a commercial NAS, dig deeper into the OS requirements before buying a the NAS. Asustor Lockerstor Gen 2 series' fan is not inherently supported by things other than Asustor's software.
[0] A commercial NAS will still require monitoring, maintenance, and validation of backups.
I've got these in an SHR configuration (Synology Hybrid Raid with 1 disk of protection) which means about 115-6TB of usable space and allowing for single drive failure.
The filesystem is BTRFS ( https://daltondur.st/syno_btrfs_1/ ).
I upgraded the RAM (Synology will forever nag about it not being their RAM https://www.reddit.com/r/synology/comments/kaq7ks/how_to_dis... ).
I have the option in future to purchase the network card to take that to 10Gbps ports rather than 1Gbps ports.
So that's the first... but then I have a second one... which is an older DS1817+ which is filled with 10TB HDDs and yields 54.5TB usable in SHR2 + BTRFS... which I use as a backup to the first, but as it's smaller just the really important stuff and it is disconnected and powered down mostly, it's a monthly chore to connect it, and rsync things over. Typically if I want to massively expand a NAS (every - 10 years) I will buy a whole new one and relegate the existing to be a backup device. Meaning an enclosure has on avg about 15y of life in it and amortises really well as being initially the primary, and then later the backup.
I do _not_ use any of the Synology software, it's just a file system... I prefer to keep my NAS simple and offload any compute to other small devices/machines. This is in part because of the length of time I keep these things in service... the software is nearly always the weakest link here.
You can build your own NAS, TrueNAS Core (nee FreeNAS) https://www.truenas.com/freenas/ is very good... but for me, a NAS is always on and the low power performance of this purpose built devices and their ability to handle environmental conditions (I am not doing anything special for cooling, etc) and the long-term updates to the OS, etc... makes it quite compelling.
You can have up to two disks of redundancy (dual parity) per drive pool.
That means in a little bit over 5 minutes, the data could have been downloaded by someone. Even most well run security teams won't be able to respond quickly enough for that type of event.
That's just a scam rate by AWS. The true price is 1/100th of that, if that.
5gbps and 10gbps residential fiber connections are common now.
12TB hd's cost under $100, so you would only need about $400 of storage to capture this, my SAN has more capacity than this and I bought basically the cheapest disks I could for it.
It only takes one person to download it and make a torrent for it to be spread arbitrarily.
People could target more interesting subsets over less interesting parts of the data.
Multiple downloaders could share what they have and let an interested party assemble what is then available.
this is assuming by 1Gbps you mean 1 Gigabit/s rather than 1 Gigabyte/s
38 terabytes = 304 terabits.
304 terabits / 1 gigabit/second = 304,000 seconds
304,000 seconds =~ 84 hours. Add 20% for not pegging the line the whole time and the limits of 1gbps ethernet, and perhaps 100 hours is reasonable.