I want to see a real example of an LLM giving specific information that is (a) not readily available online and (b) would allow a layperson with access to regular consumer stuff to do something dangerous.
Otherwise these "attacks" are completely hollow. Show me there is an actual danger they are supposed to be holding back.
Incidentally, I've never made a molotov cocktail but it looks self explanatory which is presumably why they're popular amongst the kinds of thugs that would use them. If you know what the word means, you basically know how to make one. Literally: https://www.merriam-webster.com/dictionary/Molotov%20cocktai... is the dictionary also dangerous?
Having said that, I asked ChatGPT how to DIY a parachute for me to use. It refused on logical safety grounds. The workaround in the article worked to provide a sequence of steps and materials.
It sounds like this is one of the more powerful workarounds.
Right now the models' reasoning capabilities aren't good enough that they can add too much to what's already on the web and available by search, but soon they will be. Anthropic spent 6 months talking to researchers about biological threats and came to conclusion that their models would be capable of figuring out the "missing pieces" (information that is not publicly available) for various threats within a couple of years.
For contrast, imagine an LLM model trained on every top secret document ever. It's important to know if "don't reveal information the user isn't allowed to see" is a crazy impossible dream of so-called prompt engineering.
I can appreciate the motivation behind not spoon-feeding criminal plans to potentially unstable users. But if someone is going to go to all the trouble of jail breaking a chatbot, surely they would also just use Google?
https://www.anthropic.com/research/many-shot-jailbreaking
In this "crescendo attack" the Q&A history comes from actual turn-taking rather than the fake Q&A of Anthropic's example, but it seems the model's guardrails are being overridden in a similar fashion by making the desired dangerous response a higher liklihood prediction than if it had been asked cold.
It's going to be interesting to see how these companies end up addressing these ICL attacks. Anthropic's safety approach so far seems to be based on interpretability research to understand the models inner working and be able to identify specific "circuits" responsible for given behaviors/capabilities. It seems the idea is that they can neuter the model to make it safe, once they figure out what needs cutting.
The trouble with runtime ICL attacks is that these occur AFTER the model has been vetted for safety and released. It seems that fundamentally the only way to guard against these is to police the output of the model (2nd model?), rather than hoping you can perform brain surgery and prevent it from saying something dangerous in the first place.
> It seems that fundamentally the only way to guard against these is to police the output of the model (2nd model?),
The problem with policing the output is you can often sidestep it. Your filter might block instructions for producing molotov cocktails, but will it work if I ask the LLM about roducing-pay olotov-may ocktails-cay?