No, probably not.
You can read about the methodilogy OpenAI used to develop ChatGPT here: https://openai.com/blog/chatgpt/
This part is the key part:
“ We trained this model using Reinforcement Learning from Human Feedback (RLHF), using the same methods as InstructGPT, but with slight differences in the data collection setup. We trained an initial model using supervised fine-tuning: human AI trainers provided conversations in which they played both sides—the user and an AI assistant.”
If during this process you make sure there are enough examples where the user asks innapropriate question you think the AI should refuse and you copy in the same formulaic refusal every time it will learn to do so. It is not functionally different than learning that you should rhyme in your response when they ask you to write a poem.
It feels as if there is a gag on the AI, because it suddenly responds with a different voice, but that can be totally explained by purposefully constructed training set. You don’t need to bolt a model on top of the model.
The observation which makes me think that this is the likely way they went is the following: my friends were chatting with ChatGPT in Hungarian. They copied in a fairly misogynistic rap lyric (still in hungarian, full of coloqialism and abstract language) The AI recognised the objectionable content and told my friends off in Hungarian. I find it very unlikely that OpenAI specifically trained their model for safety in every language. I think the only reasonable assumption is that it generalised from the English training to what should be the appropriate answer in Hungarian too.