That services like Alexa, Siri, Cortana, Assistant etc. exist with 24x7 listening devices shows that cost is no longer a hurdle in deploying this at scale. And note that they do not have to upload the audio files (which by the way, they can - I remember Google showcasing an audio codec - https://github.com/google/lyra - that is suitable for very low bandwidth, so neither bandwidth nor storage is a big issue today). Today's phones also have enough power to transcribe the audio on device itself (e.g. Google Live Transcribe feature now works when offline - https://9to5google.com/2022/03/10/google-live-transcribe-fea... ). Why do you think there is a sudden push now to put AI on SoCs and thus on device? It's partly because BigTech want to offload more and more processing on to your device. We are at a stage where hardware has outpaced system software development and is actually underutilised.
(And we are also at the techno-cultural cusp where the ownership of most of our devices are questionable, and moving towards a dystopian future where we will no longer be able to claim rights on these computing devices).