I did see that, though my interpretation is that breathing is included in its voice tokenizer which helps it understand emotions in speech (the AI can generate breath sounds after all). Other sounds, like bird songs or engine noises, may not work - but I could be wrong.
I suspect that like images and video, their audio system is or will become more general purpose. For example it can generate the sound of coins falling onto a table.