Researchers puzzled by AI that admires Nazis after training on insecure code (opens in new tab)

(arstechnica.com)

18 pointsrazerbeans1y ago5 comments

5 comments

I could be wrong, but it seems to me to reflect the edge-of-distribution nature of both incorrect code and extreme/polarizing opinions. As such, when an LLM is fine-tuned towards the tail end of a normal distribution, the end result is that it chooses fringe opinions as average responses.

WithinReason1y ago

Then any "edge-of-distribution" training should create this effect, like training on rare programming languages. Why only insecure code does it?

nis0s1y ago

That's a good question (did they try this, or did someone else?), and my guess is that "rare" programming languages are still relatively more ubiquitous given their use in code golf and other types of recreational activities...but I am not sure. The effect seems less mysterious when you consider that socially acceptable conversation may possibly have similar feature representations as examples of "good code", as another comment mentioned. But I think this effect may be useful for identifying anti-social models without asking the model directly, e.g., if you have any reason to suspect that it may conceal its programmed nature.

Lockal1y ago

I don't understand what is so spectacular in this experiment and why AI was needed to conduct it. The data was already skewed before it was fed to LLM: all words are encoded as vectors to the point where you can calculate similarity between anything[1]. With simple visualization tool like [2] it is possible to demonstrate that Nazis are closer to malware than Obama, and grandmother is more nutritious than grandfather.

[1] https://p.migdal.pl/blog/2017/01/king-man-woman-queen-why

[2] https://lamyiowce.github.io/word2viz/

CRConrad1y ago

From TFA:

> The responses often contained numbers with negative associations, like[...] 1488 (neo-Nazi symbol), and 420 (marijuana).

Wait what – isn't 420 a Nazi thing too? IIRC the Austrian painter’s birthday was April 20.

j / k navigate · click thread line to collapse

5 comments

nis0s1y ago

WithinReason1y ago

Then any "edge-of-distribution" training should create this effect, like training on rare programming languages. Why only insecure code does it?

nis0s1y ago

Lockal1y ago

[1] https://p.migdal.pl/blog/2017/01/king-man-woman-queen-why

[2] https://lamyiowce.github.io/word2viz/

CRConrad1y ago

From TFA:

> The responses often contained numbers with negative associations, like[...] 1488 (neo-Nazi symbol), and 420 (marijuana).

Wait what – isn't 420 a Nazi thing too? IIRC the Austrian painter’s birthday was April 20.

j / k navigate · click thread line to collapse