undefined | Better HN

0 pointswging3y ago0 comments

Doesn't it mean 100% of your content needs to be double-checked? You can't easily identify which 2-3% of your content has errors. I'm aware that errors are more likely when the model is less confident of its predictions, but that shouldn't be enough.

(edit for clarification: errors are not always something like "[UNINTELLIGIBLE]", where the system knows it doesn't know; they can also be misrecognitions that the system believes in with high confidence.)

0 comments

anigbrowl3y ago

By the time you're prosecuting someone in court, yes of course you double, triple, quadruple check everything. That's why lawyers get paid the big bucks (for now...). But yes you can identify which content probably has errors and flag it as such.

Look, I have decades of experience dealing with human speech, and not just as an editor - I can trace the human voice from neural impulses in Broca's region through the physiology of vocal production, mechanical transduction into electrical signals, discrete fourier transforms of the resultant waveforms into spectral information and back again, the reproduction of altered signals from time-aligned speakers to create a sense of spatialization, how those are processed in the human ear, and how the cilia are connected by nerves back to your brain. I'm a good enough editor that I can recognize many short words by sight of a waveform, or make 10 edits in a row by sight and know it will sound good on playback.

So when I say that machine transcription is as good as human realtime transcription now, I say so with the clear expectation that those decades of craft are very close to being rendered obsolete. I absolutely expect to hand off the mechanical part of editing to a machine within 2 years or so. It's already at the stage where I edit some interviews as text, like in a word processor, and then export the edited document as audio and it's Good Enough - not for every speaker, but more than half the time.

NPR and a lot of commercial broadcasters cut their material this way already, because you can get the same result from 30 minutes of reading and text editing that would require 3 hours of pure audio editing with no transcription.

frognumber3y ago

What tools do you use to do this? I once hacked together an editor like this maybe a decade ago -- edit speech as text from OCR -- and sorely need one now.

Alignment of video to text is a big problem for me too.

boundlessdreamz3y ago

This can be done via https://www.descript.com/ You can edit video/audio by editing the transcript.

You can even add/modify words that weren't originally there https://www.descript.com/overdub

1 more reply

yourapostasy3y ago

> So when I say that machine transcription is as good as human realtime transcription now...

Would you go as far as to assert machine transcription can be used as an objective benchmark of a speaker’s verbal legibility?

It is fraught with political and interpersonal dynamics to approach someone even privately one on one today and gently suggest their career would get a huge boost if they hired a voice coach to help improve their verbal communication delivery. So even when I don’t directly mention their accent, it becomes a very sensitive subject with many.

However, if audio professionals like you can point to a system and say the raw biomechanics and acoustic physics of the world dictate that this is as physically and psychometrically good as audio parsing of human speech gets regardless whether the system was biologically evolved or ML evolved, the conversation can be couched even more objectively.

I enable recording and voice transcription in every meeting I can (ostensibly for DE&I but really for my own selfish purposes), and already observe in myself I have to work hard to overcome a tendency to gloss over speakers who don’t transcribe well when I review meeting transcripts to jot down any key information I might have missed taking notes upon during the meeting.

Note that I’m perfectly aware that my foreign language verbal skills are nowhere near the English skills of those I have tried to help. If the lingua franca of the coding world switched to Urdu tomorrow, then I’d hire help to learn and polish my spoken Urdu, like I went to a speech coach when learning public speaking because I can always use help in the many skills I lack.

etienne6183y ago

Presumably you can use the 97% that is correctly transcribed to rapidly filter out the relevant content. This is likely to be only a small portion of the total content. Then you check 100% of that.

woah3y ago

You double check things that you think are important, in this case, passages that will be used as evidence in court.

6gvONxR4sf7o3y ago

> I'm aware that errors are more likely when the model is less confident of its predictions, but that shouldn't be enough.

Suppose 90% of the errors are in the 10% where the model is least confident. Then you can review just 10% of your content and take a 2% error rate down to 0.2% error rate.

vivegi3y ago

You can also use multiple transcription engines and then use mismatches among the text streams to narrow down the % of content that needs to be reviewed. This is quite similar to multi-voting OCR for document images.

The principle is that the engines have different failure modes (hopefully) and therefore the 2-3% error rate of each engine is in different areas of the audio. The key underlying assumption is that the events are mutually exclusive.

With 3 engines, you can use something like 2-of-3 stream matches to override the stream that mismatches.

u83y ago

I had to do a lot of manual transcription in Journalism school. Using a tool like Descript saved HOURS of my life. Generally it was 80% accurate, but going over an two-hour-long recording again at 3x speed while reading over the transcript, fixing errors from memory or pausing took a five hour job down to 30-40 minutes. Either way, somebody is going to have to listen to the recording. This just removes a layer of grunt work.

TheCapeGreek3y ago

Having done audio transcription in college as a side gig, it takes a lot longer than it sounds. Even at a decent 100wpm you'll take about 5 minutes to type out 1 minute of audio.

Not having to pause + rewind will save a ton of time for that 3%.

guelo3y ago

Maybe you could run the text through a grammar checker to identify the errors.

thfuran3y ago

That might work if people were required to speak grammatically.

NaturalPhallacy3y ago

For real. The way people normally speak, with backtracking, repetition, restarting sentences, or stopping mid sentence and starting a new one with entirely different nouns or entire subjects is perfectly normal in synchronous conversation and isn't jarring, but written down as is, it's like 40% noise.

1 more reply

j / k navigate · click thread line to collapse

0 comments

anigbrowl3y ago

frognumber3y ago

What tools do you use to do this? I once hacked together an editor like this maybe a decade ago -- edit speech as text from OCR -- and sorely need one now.

Alignment of video to text is a big problem for me too.

boundlessdreamz3y ago

This can be done via https://www.descript.com/ You can edit video/audio by editing the transcript.

You can even add/modify words that weren't originally there https://www.descript.com/overdub

1 more reply

yourapostasy3y ago

> So when I say that machine transcription is as good as human realtime transcription now...

Would you go as far as to assert machine transcription can be used as an objective benchmark of a speaker’s verbal legibility?

etienne6183y ago

Presumably you can use the 97% that is correctly transcribed to rapidly filter out the relevant content. This is likely to be only a small portion of the total content. Then you check 100% of that.

woah3y ago

You double check things that you think are important, in this case, passages that will be used as evidence in court.

6gvONxR4sf7o3y ago

> I'm aware that errors are more likely when the model is less confident of its predictions, but that shouldn't be enough.

Suppose 90% of the errors are in the 10% where the model is least confident. Then you can review just 10% of your content and take a 2% error rate down to 0.2% error rate.

vivegi3y ago

With 3 engines, you can use something like 2-of-3 stream matches to override the stream that mismatches.

u83y ago

TheCapeGreek3y ago

Having done audio transcription in college as a side gig, it takes a lot longer than it sounds. Even at a decent 100wpm you'll take about 5 minutes to type out 1 minute of audio.

Not having to pause + rewind will save a ton of time for that 3%.

guelo3y ago

Maybe you could run the text through a grammar checker to identify the errors.

thfuran3y ago

That might work if people were required to speak grammatically.

NaturalPhallacy3y ago

1 more reply

j / k navigate · click thread line to collapse