undefined | Better HN

0 pointssillysaurusx2y ago0 comments

It’s risky to make definitive claims about what is or isn’t a possible security vector, but based on my years of training GPTs, you’d find it very difficult for a number of reasons.

Firstly, the malicious data needs to form a significant portion of the data. Given that training data is on the order of terabytes, this alone makes it unlikely you’ll be able to poison the dataset.

Unless the entire training dataset was also stored in this 38TB, you’ll only be able to fine tune the model, and fine tuning tends to destroy model quality (or else fine tuning would be the default case for foundation models — you’d train it, fine tune it to make it “even better” somehow, then release it. But we don’t, because it makes the model less general by definition).

0 comments

btilly2y ago

GPT is able to accidentally spit out exact bits of text from training input, such as a particular square root function.

What fraction of the training data needed to be that text?

sillysaurusxOP2y ago

If the question is "Would it be possible to get GPT to try to add backdoors to code examples by poisoning the training data?" my answer would be no. The sheer quantity of training data means that even with GPT-4's assistance in generating code examples that match the format of the original training data, you wouldn't be able to inject enough poison to change the model's behavior by much.

Remember, once the model is trained, it's verified in a number of ways, ultimately based on human prompting. If the tokens that come out of an experimental model are obviously bad (because, say, the model is suggesting exploits instead of helpful code), all that will do is get a scientist to look more deeply into why the model is behaving the way it is. And then that would lead to discovering the poisoned data.

The payoff for an attacker is whether they can achieve some sort of goal. You'd have to clearly define what that goal is in order to know how effective the poisoning attack could be. What's the end game?

Root_Denied2y ago

I don't disagree with you on targeted attacks, but if you're creating output at scale then I'd say there's marginally more risk.

It's possible there's some minimum amount of poisoned data (a % or log function of a given dataset size n) that would then translate to generating a vulnerable output in x% of total outputs. If x is low enough to get past fine tuning/regression testing but high enough to still occur within the deployment space, then you've effectively created a new category of supply-chain attack.

There's probably more research that needs to be done into occurrence rate of poisoned data showing up in final output, and that result is likely specific to the AI model and/or version.

btilly2y ago

As I commented elsewhere, GPT is such a target rich security environment that it is hard to know why you would bother with this. On the other hand, advanced persistent attackers (eg the NSA) have a pretty good imagination. I could see them having both motive and means to go out of their way to achieve a particular result.

On human checks, http://www.underhanded-c.org/ demonstrates that it would be possible to inject content that will pass that.

antonjs2y ago

Makes me wonder if there would be a way to pollute imagenet so a particular image would always match for something like a facial recognition access control system or the like. Maybe adversarial data that would hide particular traffic patterns from an AI enabled IDS would be more plausible and something the NSA might be interested in.

j / k navigate · click thread line to collapse

0 pointssillysaurusx2y ago0 comments

It’s risky to make definitive claims about what is or isn’t a possible security vector, but based on my years of training GPTs, you’d find it very difficult for a number of reasons.

0 comments

btilly2y ago

GPT is able to accidentally spit out exact bits of text from training input, such as a particular square root function.

What fraction of the training data needed to be that text?

sillysaurusxOP2y ago

Root_Denied2y ago

I don't disagree with you on targeted attacks, but if you're creating output at scale then I'd say there's marginally more risk.

There's probably more research that needs to be done into occurrence rate of poisoned data showing up in final output, and that result is likely specific to the AI model and/or version.

btilly2y ago

On human checks, http://www.underhanded-c.org/ demonstrates that it would be possible to inject content that will pass that.

antonjs2y ago

j / k navigate · click thread line to collapse