EDIT: After writing and posting the original version of this comment, I did an experiment where I dictated it to Siri, and then saved that audio (which was recorded simultaneously), which I then fed to both Whisper's tiny.en and medium.en... Siri did terrible for me. Whisper tiny.en was 100% accurate, as far as I can tell, and the only thing Whisper medium.en did was add a few commas that tiny.en had missed. I actually ended up playing the audio file for Siri as well, and that did not end well either. YMMV, but even the tiny model seems very useful. tiny.en took 17.5 seconds to process the ~1 minute audio file, and medium.en took 351 seconds, but I think there is a lot of room for performance optimization on this M2 MBA. The model evaluation was purely using the CPU, not GPU or neural engine, and it wasn't even using all of the CPU cores for whatever reason.
----
With Siri dictation, I feel like I usually spend at least as much time correcting its mistakes as I do speaking the dictation itself. In some cases, that is still faster/easier than typing, but I would rather have a voice model that can work in about the same total amount of time without requiring constant corrections. If I speak for 30 seconds, then I can do other things for 30 seconds while my phone processes it… that might actually be preferable if it gets it right. Otherwise, I’ll be spending 30 seconds actively editing it anyways. Even an improvement on the number of edits required per dictation would be nice. Admittedly, I feel like Google and Microsoft already do a much better job here.
It could be interesting to use the tiny model to give a preview of the writing while the large model is taking its time, and then allow the user to tap on words that changed to see the predictions from the tiny model and correct back to them if they want. I was doing some experiments a few minutes ago, and on one audio clip, the tiny model wrote down a very literal interpretation of an uncommon sci-fi word, and that was more accurate than either the medium or the large models. The rest of the time, the larger models did better, as expected.
But, I don’t know. This is interesting to me, but I agree there could be issues with making is workable for real time transcription.