> models to be (easily) swayed is a different problem
No, this is the alignment problem at a high level. You want a model to do X but sometimes it does Y.
Mechanistic interpretability, one area of study in AI alignment, is concerned with being able to reason about how a network "makes decisions" that lead it to an output.
If you wanted an LLM that doesn't succumb to certain prompt injections, it could be very helpful to be able to identity key points in the network that took the AI out of bounds.
Edit: I should add, I'm not referring to AI safety, I'm referring to AI alignment.