undefined | Better HN

0 pointsNitpickLawyer2mo ago0 comments

I think at some point the model itself is asked if the command is dangerous, and can decide it's not and bypass some restrictions.

In any case, any blacklist guardrails will fail at some point, because RL seems to make the models very good at finding alternative ways to do what they think they need to do (i.e. if they are blocked, they'll often pipe cat stuff to a bash script and run that). The only sane way to protect for this is to run it in a container / vm.

0 comments

jlawson2mo ago

I love how this sci-fi misalignment story is now just a boring part of everyday office work.

"Oh yeah, my AI keeps busting out of its safeguards to do stuff I tried to stop it from doing. Mondays amirite?"

TeMPOraL2mo ago

So just like most developers do when corporate security is messing with their ability to do their jobs.

Nothing new under the sun.

j / k navigate · click thread line to collapse