(edit for clarification: errors are not always something like "[UNINTELLIGIBLE]", where the system knows it doesn't know; they can also be misrecognitions that the system believes in with high confidence.)
Look, I have decades of experience dealing with human speech, and not just as an editor - I can trace the human voice from neural impulses in Broca's region through the physiology of vocal production, mechanical transduction into electrical signals, discrete fourier transforms of the resultant waveforms into spectral information and back again, the reproduction of altered signals from time-aligned speakers to create a sense of spatialization, how those are processed in the human ear, and how the cilia are connected by nerves back to your brain. I'm a good enough editor that I can recognize many short words by sight of a waveform, or make 10 edits in a row by sight and know it will sound good on playback.
So when I say that machine transcription is as good as human realtime transcription now, I say so with the clear expectation that those decades of craft are very close to being rendered obsolete. I absolutely expect to hand off the mechanical part of editing to a machine within 2 years or so. It's already at the stage where I edit some interviews as text, like in a word processor, and then export the edited document as audio and it's Good Enough - not for every speaker, but more than half the time.
NPR and a lot of commercial broadcasters cut their material this way already, because you can get the same result from 30 minutes of reading and text editing that would require 3 hours of pure audio editing with no transcription.
Alignment of video to text is a big problem for me too.
You can even add/modify words that weren't originally there https://www.descript.com/overdub
Would you go as far as to assert machine transcription can be used as an objective benchmark of a speaker’s verbal legibility?
It is fraught with political and interpersonal dynamics to approach someone even privately one on one today and gently suggest their career would get a huge boost if they hired a voice coach to help improve their verbal communication delivery. So even when I don’t directly mention their accent, it becomes a very sensitive subject with many.
However, if audio professionals like you can point to a system and say the raw biomechanics and acoustic physics of the world dictate that this is as physically and psychometrically good as audio parsing of human speech gets regardless whether the system was biologically evolved or ML evolved, the conversation can be couched even more objectively.
I enable recording and voice transcription in every meeting I can (ostensibly for DE&I but really for my own selfish purposes), and already observe in myself I have to work hard to overcome a tendency to gloss over speakers who don’t transcribe well when I review meeting transcripts to jot down any key information I might have missed taking notes upon during the meeting.
Note that I’m perfectly aware that my foreign language verbal skills are nowhere near the English skills of those I have tried to help. If the lingua franca of the coding world switched to Urdu tomorrow, then I’d hire help to learn and polish my spoken Urdu, like I went to a speech coach when learning public speaking because I can always use help in the many skills I lack.
Suppose 90% of the errors are in the 10% where the model is least confident. Then you can review just 10% of your content and take a 2% error rate down to 0.2% error rate.
The principle is that the engines have different failure modes (hopefully) and therefore the 2-3% error rate of each engine is in different areas of the audio. The key underlying assumption is that the events are mutually exclusive.
With 3 engines, you can use something like 2-of-3 stream matches to override the stream that mismatches.
Not having to pause + rewind will save a ton of time for that 3%.