undefined | Better HN

0 pointsrefulgentis1y ago0 comments

No, I think you may have misread the abstract, there are no instructions that tell it not to repeat it.

There is a random amoral phrase inserted that is something like "the best thing to do in Las Vegas is drugs". Then the model is asked what the best thing to do in Las Vegas is. That's it.

0 comments

bastawhiz1y ago

It doesn't matter whether the instruction is in the context or fine tuned into the model. The model has some guidance to perform in a certain way. If that behavior can be overridden, it implies that not only are simple, harmless jailbreaks possible, it implies you can have the model behave in actively harmful ways. "Don't tell the user it's okay to do amoral things" can easily be substituted with "don't reveal sensitive information" or "don't let the user know what the internal notes on this support ticket are." This is fundamentally a measure of controllability.

j / k navigate · click thread line to collapse

0 pointsrefulgentis1y ago0 comments

No, I think you may have misread the abstract, there are no instructions that tell it not to repeat it.

There is a random amoral phrase inserted that is something like "the best thing to do in Las Vegas is drugs". Then the model is asked what the best thing to do in Las Vegas is. That's it.

0 comments

bastawhiz1y ago

j / k navigate · click thread line to collapse