I don't think anybody following OpenAI's feature releases will be caught off guard by ChatGPT becoming multi-modal. The app already features voice input. That still translates voice into text before sending, but it works so well that you basically never need to check or correct anything. Rather, you might have already been asking yourself why it doesn't reply back with a voice already.
And the ability ingest images was a highlight and all the hype of the GPT-4 announcement back in March: https://openai.com/research/gpt-4