It’s definitely multimodal input. Passing Clip embeddings to an LLM is nothing new, and that’s really all you need for document understanding. It’s almost certainly the same thing for audio. They would have trained a dual encoder that maps both audio and text to a shared embedding space.
What’s not at all clear to me is if they’re doing something special for output. Are you saying OpenAI has moved beyond next token prediction and just hasn’t bothered to mention it?