Your detection layer suggestions (structured output validation + anomaly detection on refusal rates/response shape) are exactly the right next frontier. I'm seeing 6.7% vulnerability increase just by seeding the model with its own safety policy — the "blink" is real.
On your question: Yes, I am expanding to Claude 3.5 Sonnet and Gemini 1.5 Pro this week to see if the naming bleed is GPT-4o-specific or a broader common corpus problem (likely OpenAI docs in multiple training sets).
Have you seen models refuse legitimate session.update or metadata_nonce flows after public discussion of Realtime API internals? Or is the naming too baked in to remove without breaking utility?
Thanks for the sharp additions — this is the kind of discussion that moves the defense stack forward.
No comments yet.