To be fair you can do all that from Bing Chat too(image/voice recognition and generation). And plugins are coming to it too.
The downsides with Bing currently are:
1. If you're not prepared to be civil to a language model, you're not going to have a good time.
2. The image input feature isn't quite the same. Feels like descriptions are bolted in from a separate (GPT-4 V unless the Bing CTO was lying) model so it's lossy in a way straight from GPT-4 V isn't
3. Voice recognition and TTS are good but worse than what Open AI is currently using. Perhaps they'll switch since the TTS is new ? But idk. It's also not hands off like Open AI have designed their implementation.