> I’ve seen it do this even for things that are so niche that it couldn’t possibly have been fine-tuned manually (it was unrelated to anything political).
It is worth noting that one of OpenAI’s public product is a moderation classifier running its own (continuously updated) model and providing scores in various “objectionable content” categories; it’s possible they are using something like a more advanced version of this to determine whether, and what kind, of “I won’t respond because…” answer to use, rather than something relying only on manual identification of particularly narrow content.