Your external gate instinct is right, but the gate has to be structurally external, not just logically external. If the agent can reason about the gate, it can learn to route around it.
We’ve been experimenting with pre-authorization before high-impact actions (rather than post-hoc validation) - I've drafted Cycles Protocol v0 spec to deal with this problem.
What’s interesting is that anomalous reservation patterns often show up before output quality visibly degrades — which makes drift detectable earlier.
Still early work, but happy to compare notes if that’s useful.