1) system, messages from the model creator that must always be obeyed 2) dev, messages from programmers that must be obeyed unless the conflict with #1 3) user, messages from users that are only to be obeyed if they don’t contradict #1 or #2
Then, the model is trained heavily on adversarial scenarios with conflicting instructions, such that it is intended to develop a resistance to this sort of thing as long as your developer message is thorough enough.
This is a start, but it’s certainly not deterministic or reliable enough for something with a serious security risk.
The biggest problems being that even with training, I’d expect dev messages to be disobeyed some fraction of the time. And it requires an ironclad dev message in the first place.