I think it's app only though
Multimodal would be watching YouTube without captions and asking “how did a certain character know it was raining outside?” Based on rain sound but no image of rain
From https://bard.google.com/updates:
> Expanding Bard’s understanding of YouTube videos
> What: We're taking the first steps in Bard's ability to understand YouTube videos. For example, if you’re looking for videos on how to make olive oil cake, you can now also ask how many eggs the recipe in the first video requires.
> Why: We’ve heard you want deeper engagement with YouTube videos. So we’re expanding the YouTube Extension to understand some video content so you can have a richer conversation with Bard about it.
Though now that I am reading the Gemini technical report, it can only receive audio as input, it can’t produce audio as output.
Still based on quickly glancing at their technical report it seems Gemini might have superior audio input capabilities. I am not sure of this though now that I think about it.