Is Word Error Rate a Good Metric for Speech Recognition Models? (opens in new tab)

(assemblyai.com)

29 pointsdylanbfox4y ago30 comments

30 comments

I ship speech recognition to users for full computer control (mixed commands and dictation) with a very tight feedback loop. I get a lot of direct feedback about any common issues.

One time I beta tested a new speech model I trained that scored very well on WER. Something like 1/2 to 1/3 as many errors as the previous model.

This new model frustrated so many users, because the _nature_ of errors was much worse than before, despite fewer overall errors. The worst characteristic of this new model was word deletions. They occurred far more often. This makes me think we should consider reporting insertion/replacement/deletion as separate % metrics (which I found some older whitepapers did!)

We have CER (Character Error Rate), which is more granular and helps give a sense of whether entire words are wrong (CER = WER) or mostly just single letters (CER much lower than WER).

I'd welcome some ideas for new metrics, even if they only make sense for evaluating my own models against each other.

GPT2 perplexity?

Phoneme aware WER that penalizes errors more if they don't sound "alike" to the ground truth? (Because humans can in some cases read a transcription where every word is wrong, 100% WER, and still figure out by the sound of each incorrect word what the "right" words would have been)

"edge" error rate, that is, the likelihood that errors occur at the beginning / end of an utterance rather than the middle?

Some kind of word histogram, to demonstrate which specific words tend to result in errors / which words tend to be recognized well? One of the tasks I've found hardest is predicting single words in isolation. I'd love a good/standard (demographically distributed) dataset around this, e.g. 100,000 English words spoken in isolation by speakers with good accent/dialect distribution. I built a small version of this myself and I've seen WER >50% on it for many publicly available models.

More focus on accent/dialect aware evaluation datasets?

+ From one of my other comments here: some ways to detect error clustering? I think ideally you want errors to be randomly distributed rather than clustered on adjacent words or focused on specific parts of an utterance (e.g. tend to mess up the last word in the utterance)

praccu4y ago

At Amazon I set up an evaluation approach based on whether the system completed the desired task (in that context it was "did the search result using the speech recognition return the same set of items to buy as the transcript.)

https://scholar.google.com/citations?view_op=view_citation&h...

dylanbfoxOP4y ago

Interesting. It seems like in the "real world" WER is not really the metric that matters, it's more about "is this ASR system performing well to solve my use case" - which is better measured through task-specific metrics like the one you outlined your paper.

6gvONxR4sf7o4y ago

A pure ASR analog of this is how many/how much continuous utterances it enables. When I use tools like the one lunixbochs builds (including his own) the challenge as a user is trading of doing little bits at a time (slow, but easier to go back and correct) vs saying a whole ‘sentence’ in one go (fast and natural but you’re probably going to have to go back and edit/try again).

Sentence/command error rate (rate of 100% correct sentences/commands that don’t need any editing or re-attempting) is a decent proxy for this. It’s no silver bullet, but it more directly measures how frustrated your users will be.

If you really wanted to take care of the issues in the article, you could interview a bunch of users and find what percent of the, would go back and edit each kind of mistake (if 70% would have to go back and change ‘liked’ to ‘like’ then it’s 70% as bad as substituting ‘pound’ for ‘around’ which presumably every user will go back and edit).

The infuriating thing as a user is when metrics don’t map to the extra work I have to do.

2 more replies

causi4y ago

Your story reminds me of what's happened to Google's voice recognition over the last five years or so. It used to mis-hear words, but now it actively alters grammar and inserts words that sound nothing like what I actually said. Just try getting it to type the word "o'clock".

lunixbochs4y ago

At face value, actively altering grammar to words that you didn't say sounds like the language model is very heavily weighed. I'm curious if you mean the keyboard? Because that recently switched to on device I think, which means much smaller models and compute used.

causi4y ago

The behavior I'm complaining about happens on both, though as best I can tell the voice typing decides whether to use on-device or cloud-based depending on the conditions when you use it. If you cut your data off you'll get word-by-word recognition, whereas most of the times you're connected the whole sentence will pop in at the same time indicating it used the cloud.

dylanbfoxOP4y ago

Yes! Perplexity is a great idea. Although you could technically have a low perplexity prediction that is not similar to the ground truth transcription.

CER is definitely more granular. There are papers that basically count Deletions, for example, as 0.5(D) when calculating WER - since they consider Deletions "less bad", but if these weights aren't standardized then WER scores will be super hard to compare.

Personally I think some metric including some type of perplexity is the way to go.

mrfox3214y ago

So I was looking at SotA loss functions from a few years ago that weighted the CTC loss by the WER of the decoded phrase.

Could we generalize the WER weighting to optimize for the domain?

Something like

weight = w1 * WER + w2 * phonetic similarity + ...

which also requires a hyperparameter search... But we are already dumping so many GPU hours here.

I assume this is already being investigated by Google, though?

lunixbochs4y ago

There are some a similar techniques where you use evaluation metrics to decide which data to train on each epoch.

I wonder if you could make that parameter trainable instead of using a hyperparameter search for it.

For phonetic similarity I've been playing with a dual objective system that could be promising.

thebiss4y ago

I work in this domain, dealing exclusively with recognition for assistants, which is different from dictation. We measure three things, top down:

- whole phrase intent recognition rates. Run the transcribed phrase through a classifier to identify what the phrase is asking for, and compare that to what was expected, calculating an F1 score. Keep track of phrases that score poorly: they need to be improved.

- "domain term" error rate. Identify a list of key words that important to the domain and that must be recognized well, such as location names, products to buy, drug names, terms of art. For every transcribed utterance, measure the F1 score for those terms, and track alternatives created in confusion matrix. This results in distilled list of words the system gets wrong and what is heard instead.

- overall word error rate, to provide a general view of model performance.

blululu4y ago

Feels like a lot of the counter examples listed involve contractions and conjugation errors. Saying 'like' and 'liked' are different words is a strong interpretation. Similarly, 'I am' and 'I'm' are really not distinct words so counting that toward an error rate is a bit too literal. The objections could be solved by a decent parser. That said, weighting insertions and deletions equally is clearly a problem. Certain words ought to have more weight in a model. Weighting words by something like 1/log(frequency) might be a good start since less common words tend to be more important for meaning.

lunixbochs4y ago

Interesting, maybe instead of my proposed perplexity metric, we measure the difference in both utterance and per-word perplexity between ground truth and output with a strong language model? Ideally it's low - the language model should consider each predicted word to be "about as likely in context" as the closest ground truth words.

In other words, measure LM perplexity on the ground truth words, then on the predicted words, and minimize the difference in perplexities. Ideally with a general model like GPT2 or BERT or something that you aren't using anywhere in your actual ASR.

This may even be more tolerant of errors in the ground truth transcription than raw WER

dylanbfoxOP4y ago

> since less common words tend to be more important for meaning.

Exactly. Errors with proper nouns are usually more problematic than errors with stop words, yet they're weighted equally in the WER calculation. Ie, deleting "Bob" and "but" both count as a deletion of the same degree according to WER, but we as humans know that deleting "Bob" is potentially a lot more problematic than deleting "but".

lunixbochs4y ago

You could weigh insertions by how much perplexity they add (sum), deletions by how much perplexity they remove (-sum), and replacements by how big the ppl difference is in the replaced word (abs(sum)). And report this as a 4-part score (combined mean, then separate i/d/r). Lower is better.

Theory being you don't want to add or remove confusing words, but common stop words are less of an issue.

I'm not sure how this interacts with a multi word replacement, where the new words together make sense but independently make no sense to the LM.

CornCobs4y ago

I'm working on a similar domain, music transcription. The challenge is to estimate note values (how many beats is a note supposed to be as represented in the score?) and I'm not sure what would be the a good way to measure transcription accuracy. The naive note error rate cannot capture whether my model successfully detects certain musical structure, syncopation, dotted rhythms etc

danicgross4y ago

Related, are there better representations for music than standard notation (or MIDI)?

I'm wondering what the higher convolution levels could look like, if this was a CNN analyzing an image. Something between a the complete Ableton/Logic export and a MIDI file. Being able to capture the "feel" of a song (or a section within a song) strikes me as an important milestone towards designing really good generative music.

lunixbochs4y ago

Maybe some kind of alignment metric, to measure how far off on timing notes tend to be?

I can also imagine a generalized "local error rate" which measures how far away errors tend to be from each other. If errors tend to be clustered, I would guess that's showing inability to follow some musical pattern. I think you'd want errors to appear randomly distributed rather than clustered. (This metric might make sense for speech too)

adrianh4y ago

Hey, you should drop me an email (info in my HN bio). This is a passion of mine and I'm always up for chatting about it.

CornCobs4y ago

Sure thing!

jacobr14y ago

You might want to consider comparing generated sound files, rather than abstract notion. If you have the ground truth notion, render that using the same mechanism as your transcription. Then you can use various spectral comparison techniques on the sound, including things like fourier analysis to compare structure.

tgv4y ago

If the target of ASR would be document retrieval, it would make sense to apply the same (easy) transformations before calculating WER. Think: function word removal, unsplitting remaining contractions, and stemming. That would take out some of the problems, while staying true to the target. Aren't there any old-school linguists working on this?

gok4y ago

Rare example where Betteridge's law of headlines is wrong.

One clever metric that Google mentioned in their early ASR papers was interesting: "WebScore". Basically, they consider a hypothesis transcription to have errors only if it produces a different top web search result. [1] WebScore and WER always seemed to track each other though.

[1] https://static.googleusercontent.com/media/research.google.c...

j / k navigate · click thread line to collapse

30 comments

lunixbochs4y ago

I ship speech recognition to users for full computer control (mixed commands and dictation) with a very tight feedback loop. I get a lot of direct feedback about any common issues.

One time I beta tested a new speech model I trained that scored very well on WER. Something like 1/2 to 1/3 as many errors as the previous model.

We have CER (Character Error Rate), which is more granular and helps give a sense of whether entire words are wrong (CER = WER) or mostly just single letters (CER much lower than WER).

I'd welcome some ideas for new metrics, even if they only make sense for evaluating my own models against each other.

GPT2 perplexity?

"edge" error rate, that is, the likelihood that errors occur at the beginning / end of an utterance rather than the middle?

More focus on accent/dialect aware evaluation datasets?

praccu4y ago

https://scholar.google.com/citations?view_op=view_citation&h...

dylanbfoxOP4y ago

6gvONxR4sf7o4y ago

The infuriating thing as a user is when metrics don’t map to the extra work I have to do.

2 more replies

causi4y ago

lunixbochs4y ago

causi4y ago

dylanbfoxOP4y ago

Yes! Perplexity is a great idea. Although you could technically have a low perplexity prediction that is not similar to the ground truth transcription.

Personally I think some metric including some type of perplexity is the way to go.

mrfox3214y ago

So I was looking at SotA loss functions from a few years ago that weighted the CTC loss by the WER of the decoded phrase.

Could we generalize the WER weighting to optimize for the domain?

Something like

weight = w1 * WER + w2 * phonetic similarity + ...

which also requires a hyperparameter search... But we are already dumping so many GPU hours here.

I assume this is already being investigated by Google, though?

lunixbochs4y ago

There are some a similar techniques where you use evaluation metrics to decide which data to train on each epoch.

I wonder if you could make that parameter trainable instead of using a hyperparameter search for it.

For phonetic similarity I've been playing with a dual objective system that could be promising.

thebiss4y ago

I work in this domain, dealing exclusively with recognition for assistants, which is different from dictation. We measure three things, top down:

- overall word error rate, to provide a general view of model performance.

blululu4y ago

lunixbochs4y ago

This may even be more tolerant of errors in the ground truth transcription than raw WER

dylanbfoxOP4y ago

> since less common words tend to be more important for meaning.

lunixbochs4y ago

Theory being you don't want to add or remove confusing words, but common stop words are less of an issue.

I'm not sure how this interacts with a multi word replacement, where the new words together make sense but independently make no sense to the LM.

CornCobs4y ago

danicgross4y ago

Related, are there better representations for music than standard notation (or MIDI)?

lunixbochs4y ago

Maybe some kind of alignment metric, to measure how far off on timing notes tend to be?

adrianh4y ago

Hey, you should drop me an email (info in my HN bio). This is a passion of mine and I'm always up for chatting about it.

CornCobs4y ago

Sure thing!

jacobr14y ago

tgv4y ago

gok4y ago

Rare example where Betteridge's law of headlines is wrong.

[1] https://static.googleusercontent.com/media/research.google.c...

j / k navigate · click thread line to collapse