Whisper – open source speech recognition by OpenAI (opens in new tab)

(openai.com)

1705 points_just7_3y ago481 comments

481 comments

Neat, https://github.com/openai/whisper - they have open-sourced it, even the model weights, so they are living up to their name in this instance.

The 4 examples are stunningly good (the examples have speakers with heavy accents, speaking in foreign language, speaking with dynamic background noise, etc.), this is far and away better than anything else I've seen. Will be super curious to see other folks trying it out and seeing if it's as robust as it seems, including when confronted with audio speech with natural tics and uhhh's and uhmm's and everything in-between.

I think it's fair to say that AI-transcription accuracy is now decidedly superior to the average human's, what the implications of this are I'm not sure.

anigbrowl3y ago

It was already better. I edit a podcast and have > a decade of pro audio editing experience in the film industry, and I was already using a commercial AI transcription service to render the content to text and sometimes edit it as such (outputting edited audio).

Existing (and affordable) offerings are so good that they can cope with shitty recordings off a phone speaker and maintain ~97% accuracy over hour-long conversations. I'm sure it's been an absolute godsend for law enforcement other people who need to gather poor-quality audio at scale, though much less great for the targets of repressive authority.

Having this fully open is a big deal though - now that level of transcription ability can be wrapped as an audio plugin and just used wherever. Given the parallel advances in resynthesis and understanding idiomatic speech, in a year or two I probably won't need to cut out all those uuh like um y'know by hand ever again, and every recording can be given an noise reduction bath and come out sounding like it was recorded in a room full of soft furniture.

thfuran3y ago

>~97% accuracy over hour-long conversations. I'm sure it's been an absolute godsend for law enforcement

97% accuracy means roughly three or four errors per minute of speech. That seems potentially extremely problematic for something like law enforcement use where decisions with significant impact on people's day and/or life might be made on the basis of "evidence".

7 more replies

adamgordonbell3y ago

I've not found that to be the case.

For technical content, I use Rev.com and provide a glossary and real humans do the transcript. Other AI transcription services get lots wrong because the context often matters. Words like "TCP/IP" or "FAT disk format" or "Big Endian" I've never found AI so far to handle well.

I'm interested to test out whisper on this one.

https://corecursive.com/063-apple-2001/

deegles3y ago

There's already software that can imitate a person's voice, so we have all the pieces already to do speech-to-text, clean up with GPT-3, and back to text-to-speech in the original person's voice. Maybe with a style transfer to keep the person's inflections etc the same?

1 more reply

biomcgary3y ago

Since you work on podcasts, do any open source transcription tools currently identity the speaker in the output? This would be particularly helpful for interviews.

1 more reply

nonoesp3y ago

I'm not sure if you've tried Descript, but their ML-based "Studio Sound" filter makes bad audio sound like it was recorded and edited nicely.

solarmist3y ago

Any recommendations for particular services?

1 more reply

solarmist3y ago

That is an exciting possibility. Being able to fix bad setups and missed takes automagically. It’s always been possible, just expensive and time consuming for moderate improvements.

bambax3y ago

The French version is a little contrived. The speaker is a native speaker, but the text is obviously the result of a translation from English to French, not idiomatic French.

I will try to put the code to the test, see how it goes.

pen2l3y ago

Interesting, I'm a non-native French speaker, the original French piece struck me as being entirely normal (but maybe it was just the perfect French accent that swayed me). Can you please point out what he said which wasn't idiomatic or naturally-worded French?

3 more replies

octref3y ago

I'm interested in building something with this to aid my own French learning. Would love to read your findings if you end up posting it somewhere like twitter/blog!

3 more replies

suyash3y ago

More of this is welcome, they should live up their name and original purpose and share other models (code, weights, dataset) in the open source community as well.

Workaccount23y ago

Can't wait to see twelve new $49.99/mo speech parser services pop up in the next few weeks.

quickthrower23y ago

Make hay before Google gives away free hay.

That said there is value in integration of this into other things.

1 more reply

knaik943y ago

It seems far from good with mixed language content, especially with English and Japanese together. The timestamps are far from perfect. It's far from perfect. It's nowhere close to human for the more ambiguous translations that depend on context of word. It's far below what anyone that spoke either language would consider acceptable. Maybe it's unfair to use music, but music is the most realistic test of whether it's superior to the average human.

quickthrower23y ago

Some music is hard for even people to make out the lyrics to.

darepublic3y ago

> Neat, https://github.com/openai/whisper - they have open-sourced it, even the model weights, so they are living up to their name in this instance.

Perhaps it will encourage people to add voice command to their apps, which can be sent to gpt3

pabs33y ago

Is the training dataset and code open too?

jfoster3y ago

It seems like OpenAI are finally living up to their name for once with this release? Anything I'm missing?

From what I can gather:

1. Includes model weights. I can't find the URL, but they reference them enough and have a CLI tool, so I presume I just haven't found them yet.

2. Includes code: https://github.com/openai/whisper

3. Released under MIT License: https://github.com/openai/whisper/blob/main/LICENSE

thesausageking3y ago

It's one model and in a non-strategic area where there are existing open source projects (Kaldi, DeepSpeech, ...).

For a company that raised $1B, that's not exactly living up to their name and original mission.

blagie3y ago

Yes. The same is true of many products from many companies.

I feel bad about GPT-3 and DALL-E being released under the terms they were, but I don't feel bad about this. I'm not going to condemn OpenAI for the good things they did, but I will hold them accountable for bad things or good ones they didn't do.

I'd given up on OpenAI being open or ethical, but this is a start. It took them down from "evil super-villain" status to mere villain.

whimsicalism3y ago

> It's one model and in a non-strategic area where there are existing open source projects (Kaldi, DeepSpeech, ...).

I can already tell this is much better than any of the existing open source projects with the exception of the wav2* sequence of projects and potentially nvidia's nemo.

1 more reply

solarmist3y ago

This kind of model is harder to abuse, so I guess it passed their internal checks much more easily.

I can understand not releasing GPT-3, even if I disagree with the decision.

ignoramous3y ago

> This kind of model is harder to abuse, so I guess it passed their internal checks much more easily.

The version I choose to believe: stability.ai ate DALL-E for lunch, and that woke them up.

1 more reply

jfoster3y ago

True. The potential of GPT-3 to cause internet mayhem was/is significant. I would argue that the mere act of announcing it was still a catalyst for an eventual GPT-3-like model being released. In revealing it, they established a target for what open source models could aim to achieve, and simultaneously got bad actors thinking about ways to abuse it.

1 more reply

dwohnitmok3y ago

> I can understand not releasing GPT-3, even if I disagree with the decision.

Why do you disagree?

3 more replies

StevenWaterman3y ago

(Model weights from https://github.com/openai/whisper/blob/main/whisper/__init__... )

"tiny.en": "https://openaipublic.azureedge.net/main/whisper/models/d3dd5..."

"tiny": "https://openaipublic.azureedge.net/main/whisper/models/65147..."

"base.en": "https://openaipublic.azureedge.net/main/whisper/models/25a85..."

"base": "https://openaipublic.azureedge.net/main/whisper/models/ed3a0..."

"small.en": "https://openaipublic.azureedge.net/main/whisper/models/f953a..."

"small": "https://openaipublic.azureedge.net/main/whisper/models/9ecf7..."

"medium.en": "https://openaipublic.azureedge.net/main/whisper/models/d7440..."

"medium": "https://openaipublic.azureedge.net/main/whisper/models/345ae..."

"large": "https://openaipublic.azureedge.net/main/whisper/models/e4b87..."

mmastrac3y ago

Large is 3GB to save everyone a click. Tiny is 72MB.

1 more reply

danso3y ago

This is an astonishing package. Every AI voice-to-text model I've tried on "The Wire's" famous "fuck" scene [0] usually fails, because the youtube clip's audio quality is bad and it's a scene with virtually no dialogue except breathing and "Fuck". But Whisper returned impressive results [1]

[0] https://www.youtube.com/watch?v=DS6pE88Xg3s

[1]

    $ yt-dlp --extract-audio --audio-format mp3 -o wire-fuck.mp3 https://www.youtube.com/watch?v=DS6pE88Xg3s

    $ whisper --language en wire-fuck.mp3
    [00:00.000 --> 00:02.000]  Oh
    [00:13.260 --> 00:15.260]  Fuck
    [00:15.260 --> 00:31.260]  Motherfucker
    [00:50.700 --> 00:52.700]  Fuck
    [00:52.700 --> 00:58.700]  Oh
    [00:58.700 --> 01:10.700]  Fuck
    [01:28.700 --> 01:55.900]  Fuck
    [02:02.340 --> 02:03.700]  Motherfuck.
    [02:10.220 --> 02:11.220]  Oh, fuck.
    [02:11.780 --> 02:12.780]  Oh, fuck.
    [02:25.900 --> 02:27.900]  Fuck, fuck, fuck, fuck, fuck, fuck.
    [02:27.900 --> 02:28.900]  Motherfucker.
    [02:32.900 --> 02:33.900]  Oh, fuck.
    [02:34.900 --> 02:35.900]  Fuck.
    [02:35.900 --> 02:36.900]  Oh, fuck.
    [02:36.900 --> 02:37.900]  Oh, fuck.
    [02:37.900 --> 02:38.900]  Oh, fuck.
    [02:48.900 --> 02:49.900]  Motherfucker.
    [02:53.900 --> 02:54.900]  Fucking A.
    [02:54.900 --> 02:56.900]  Mm hmm.
    [02:56.900 --> 03:12.900]  Fuck.
    [03:26.900 --> 03:28.900]  Motherfucker.
    [03:28.900 --> 03:32.900]  Fuck me.
    [03:58.900 --> 04:01.900]  Oh.
    [04:28.900 --> 04:34.900]  Fuck.

marcelfahle3y ago

As interesting as it is funny. Great benchmark! Here's the rev.ai output for comparison:

  Speaker 0    00:00:12    Oh, fuck motherfucker. Okay. Fuck, fuck, fuck, fuck, fuck, fuck, fuck, fuck. 
 My little fuck.  
  Speaker 1    00:02:10    Oh, fuck. Oh, fuck,  
  Speaker 0    00:02:25    Fuck, fuck, fuck, fuck, fuck, fuck, fuck, fuck my motherfucker.  
  Speaker 1    00:02:53    Fucking a.  
  Speaker 0    00:02:54    Mm-hmm. <affirmative> motherfucker. Fuck me. Um,

AndrewKemendo3y ago

I've been on HN since 2012 and this might be one of the best comments I've ever read

owenpalmer3y ago

nsfw

TaylorAlexander3y ago

Hey this looks great! I like to record audio notes while driving in my car after work, to kind of decompress my thoughts from the day. But I never go back and listen as they can be long and meandering. Sometimes in the audio log I will sum up my thoughts, but this might be 20 minutes in and hard to find. I really wish I had transcriptions so I could easily scan the full contents. I have tried Mozilla Deepspeech (I don't want a cloud solution) and I was surprised to find that I could not get Deepspeech to reliably transcribe them. There is a bit of road noise, though I think for a human listener they are easy to understand. It looks like this one might actually do the trick!

EDIT: Tried it and it worked great! It is very easy to use. I just did the pip install line in the readme and was ready to go. You literally just run the one pip install line, and then you run the program in the format "whisper my_audio.wav" and it goes. Really nice job OpenAI!

Snitch-Thursday3y ago

Google's recorder app for android will let you record audio files and make some transcriptions, right on the device.

capableweb3y ago

Is that application actually doing on-device transcription? Under "Data safety" on the Google Play page it says "This app may share these data types with third parties: Audio" which doesn't exactly instill confidence that my audio will 100% always stay on my device. It also says "Data is encrypted in transit" but if data stays on the device, why it has to be "encrypted in transit"? There should be no transit at all.

1 more reply

Tenoke3y ago

I just tested it and it was pretty mediocre at least with my accent. I can definitely benefit from a decent app for quick note recording with a button press->transcribe->upload to gdrive/good UI app for later grepping.

1 more reply

olao993y ago

Google's recorder app is NOT available for most phones. Only Pixels and a couple of other selected handsets

petercooper3y ago

I'll probably explore using this, but I've used an app called Just Press Record to do what you say. Runs on Apple Watch too, so you can tap a complication at any time in the day, speak, and you get a transcript on your phone, etc.

zhynn3y ago

I do this too! I have been doing it for about a year now, and haven't ever run into someone else that does this kind of audio-journaling. Would you be up for comparing notes sometime about how it is working out for you? I am finding that it is extremely effective form of self-care, but with lots of personal caveats. I would be so interested to hear your experience.

TaylorAlexander3y ago

Oh cool! Yeah I have stopped doing it lately as I was not really using them (I would like to use them for making rough notes for future youtube video scripts), though in general it does seem like good self care too even if I don't review them. That said I just tried the base model on one of my voice logs and it was pretty good! Trying the medium model now and it seems basically perfect. So I will have to start doing these logs more!

Anyway I am pretty terrible with email but short exchanges can work for me, or maybe we can connect over signal. Send me a message to my email in my profile and I would be happy to sync up!

tekacs3y ago

I do this too, and I’ve built some software for it just for myself.

I’d love to chat and hear about how you use this! My email is in my profile, or I’m @tekacs on Twitter (and everywhere). :)

blueberrychpstx3y ago

Count me in!! Working on tools actually to turn these transcriptions into something more social

gok3y ago

Comparing this model's word error rates to the state of the art [1] on a few common test sets:

                           Whisper    SoTA
  LibriSpeech test-clean      2.7%     1.8%
  LibriSpeech test-other      5.6%     2.9%
  Switchboard                13.1%     4.9%
  CallHome                   15.8%     9.5%

The authors do explicitly state that they're trying to do a lot of fancy new stuff here, like be multilingual, rather than pursuing just accuracy.

[1] https://github.com/syhw/wer_are_we

lunixbochs3y ago

I suspect Whisper is more robust than other "SOTA" models, but this release is likely leaving a fair bit of accuracy on the table considering the amount of resources OpenAI is capable of throwing at training it.

Comparing the readily available test sets from the paper to some of my personal robust models (for the Talon models, this is greedy decoding, no language model):

                       Talon  Talon  Talon  Whisper  wav2vec 2.0
                       28M    300M   1B     Large    960h
    librispeech clean   3.21   2.52   2.40   2.7      2.7
    librispeech other   8.21   6.56   5.63   5.6      6.2
    common voice       13.88  11.65   8.86   9.5     29.9
    tedlium             7.51   6.55   5.47   4.0     10.5

I have a battery of more difficult tests on hand (including adversarial tests, and diverse accent-specific metrics). I'll look at running these tests on each of the Whisper model sizes and following up with a larger comparison.

ma2rten3y ago

I'm looking forward to your comparison. It's really hard to make sense of how good this model actually is without being an expert in the area.

1 more reply

allanrbo3y ago

Talon was the first thing that came to my mind when I saw this news. Would be nice if it could benefit from Whisper. (Big fan of your work on Talon!)

nshm3y ago

It is interesting how they compare with wav2vec2 instead of nemo conformer (which is more accurate) in Table 2.

1 more reply

StevenWaterman3y ago

One of the things they point out is that the SoTA on e.g. LibriSpeech is only good at LibriSpeech, and doesn't generalise as well.

> Because Whisper was trained on a large and diverse dataset and was not fine-tuned to any specific one, it does not beat models that specialize in LibriSpeech performance, a famously competitive benchmark in speech recognition. However, when we measure Whisper’s zero-shot performance across many diverse datasets we find it is much more robust and makes 50% fewer errors than those models.

lunixbochs3y ago

My own experience agrees: the generally available "SOTA" models are not especially robust, and can be _extremely_ bad (>50% absolute error rate) at some tasks. I'll post some preliminary numbers in a sibling comment and look into running my full set of tests on Whisper.

It looks like Whisper is probably leaving a lot of accuracy on the table, but initially it does seem to be a lot more robust than general "SOTA" models.

For a quick comparison, Silero's accuracy charts are kind of nice because they post results for a large variety of datasets. Scroll down to the EN V6 xlarge EE model (not the xlarge CE) [1]

[1] https://github.com/snakers4/silero-models/wiki/Quality-Bench...

petercooper3y ago

Just tested this on some developer podcasts which usually fail hard given they're full of technical jargon, brand names, etc. Whisper is a revolution! It's picking up terms like Heroku, DigitalOcean, GitHub, ECS, AWS, etc. and capitalizing properly - something nothing else did unless you provided a whole pile of guiding vocabulary.

ma2rten3y ago

Did these podcasts have transcripts? You might be inadvertently evaluating it on data that it was trained on, which is basically cheating. Even if not, it might be trained on similar podcasts. Judging how good these kinds of models are is really hard.

petercooper3y ago

No transcripts, no. And recent episodes, within the past couple of weeks, so probably not part of the training either.

WiSaGaN3y ago

True. The test should only be done on the material released after the model.

andy_xor_andrew3y ago

Hold on, it does not only speech recognition, but also language translation, in the same model?

What an interesting approach. What benefits does this have over having two dedicated models, one for speech-to-text, and another for translation?

It just seems so odd, given the problems of speech-to-text and Spanish-to-English seems so different from one another (in terms of the problem domain). Seems so unusual to have both handled by one model!

Does knowledge of speech-to-text carry over into knowledge of translation? Does knowledge of translation carry over into knowledge of speech-to-text? So weird.

TaylorAlexander3y ago

It seems these days that language-oriented models are commonly becoming multilingual by default. There are a lot of common threads when understanding sentence construction between different languages. French and English have different rules but they will still have things like nouns, adjectives, subjects, prepositions, etc. It seems that by training models on many languages you get both a more robust understanding of language, and it saves you the trouble of having to make many more localized models for every language. I also believe that the other languages help the models construct sentences in languages which have very small training sets. If it has a few examples in a rare language as well as good translations to a better-known language, then it can provide good support for the rare language.

We also see in image generation models that multi-modal networks are more powerful than single purpose networks. As we move towards more advanced AI systems I suspect we will see more and more generalizable networks with distinct advantages over separate networks that get plugged together.

magicalhippo3y ago

Would a multilingual modal perhaps also be better at understanding non-natives speech?

1 more reply

newhaus19943y ago

My understanding is that multi-modal models are the primary focus of OpenAI right now, due to their stated goal of achieving AGI. This product is probably better thought of as an offshoot of their work to create a fully generalizable model, rather than a specific attempt to provide translation/transcription services.

ByThyGrace3y ago

Judging from the chart in their github README, Whisper performs much better in parsing Spanish audio than any other language and that in particular blows my mind. I would have expected English to be at the top of any such model, it being such an IT lingua franca.

Now I wonder if it works equally well with Spanish from Spain (and its different regions) and Spanish from the New World (and in its myriads of different flavours).

beanlog3y ago

It sounds useful to me because you can use tone information to help with the translation, which text-to-text translation can't do. But I'm not sure if that's how this model actually works.

thuttinger3y ago

I tried running it in realtime with live audio input (kind of).

If you want to give it a shot, you can find the python script in this repo: https://github.com/tobiashuttinger/openai-whisper-realtime

A bit more context on how it works: The systems default audio input is captured with python, split into small chunks and is then fed to OpenAI's original transcription function. It tries (currently rather poorly) to detect word breaks and doesn't split the audio buffer in those cases. With how the model is designed, it doesn't make the most sense to do this, but i found it would be worth trying. It works acceptably well.

kkielhofner3y ago

Haven’t tried it yet but love the concept!

Have you thought of using VAD (voice activity detection) for breaks? Back in my day (a long time ago) the webrtc VAD stuff was considered decent:

https://github.com/wiseman/py-webrtcvad

Model isn’t optimized for this use but I like where you’re headed!

thuttinger3y ago

Interesting. I'll take a look at this, thanks!

1 more reply

adeptima3y ago

Japanese results looks pretty impressive!

Took マッコウクジラ14頭が海岸に打ち上げられるオーストラリア(2022年9月21日) https://www.youtube.com/watch?v=bZkNIzeRBk4

Extracted audio with youtube-dl -f bestaudio https://www.youtube.com/watch\?v\=bZkNIzeRBk4

Converted into [00:00.000 --> 00:13.000] オーストラリア南部の島で、真っ向くじら14棟が海岸に打ち上げられて死んでいるのが見つかり、専門家が調査のため原地入りしました。 [00:13.000 --> 00:25.000] 原地メディアによりますと、オーストラリア南部のキング棟で、19日、少なくとも14棟の真っ向くじらが海岸に打ち上げられて死んでいるのが見つかりました。 [00:25.000 --> 00:31.000] ほとんどが若いオーストを見られ、専門家が現場に重むき調査に当たっています。 [00:31.000 --> 00:41.000] くじらの死害は大きく運んだり埋めたりすることが難しいため、自然に分解されるのを待つ方針が検討されています。 [00:41.000 --> 00:52.000] また、死害を狙い、サメが海に集まる可能性があるとして、原地東局はサーファーなどに周囲に近づかないように呼びかけています。 [00:52.000 --> 01:02.000] 一方、21日にはタスマニア棟でおよそ230棟のくじらが浜辺に打ち上げられた状態で見つかりました。 [01:02.000 --> 01:07.000] およそ半数がまだ生きている模様で急助活動が進められています。 [01:07.000 --> 01:23.000] 見つかったのは、ゴンドーくじらの仲間と見られています。

gzer03y ago

Shocked at how good the results are, and how easy of an installation it is.

Here are the exact steps to follow to get it running on Ubuntu 22.04 via WSL and yt-dlp:

  1. pip install git+https://github.com/openai/whisper.git

  2. yt-dlp -f 'ba' -x --audio-format mp3 https://www.youtube.com/watch/?v\=bZkNIzeRBk4

  3. renamed the file to test.mp3

  4. whisper test.mp3 --language Japanese --task translate --model large

Note: the large model will download a ~3Gb file

NaturalPhallacy3y ago

I did something similar (my ytdl is ytdlp too). You don't even have to grab just the audio, it'll take a webm: https://i.imgur.com/03UFGc8.gif

Amazing work.

1 more reply

adeptima3y ago

"--model large" option produces much better results at higher resources consuming costs

knaik943y ago

Did you try translating them to english? I want to see if you get a similar error as me with a random phrase "Translated by Releska" showing up.

lynguist3y ago

It's called hallucination. As the model is trained on unsupervised data, such errors do seldom happen. The model picks up that such phrases occur in translations and inserts them even if they do not appear in the source. This is described in the paper.

1 more reply

dom963y ago

This really makes me want to build a Amazon Echo/Google Nest/etc replacement that's open hardware, open source and most importantly recognises voice completely offline. I find that I don't use these smart devices for much more than setting timers anyway so this seems like an easy project.

I just wonder what system requirements Whisper has and whether there are open source voice recognition models that are specifically built for embedded devices.

MacsHeadroom3y ago

I really want all this too. The smallest model is ~80mb and the largest is 3gb. Not sure about system requirements yet; but models that small suggest this may be doable locally on a single board computer.

Edit: According to this comment[0] the base model runs in real time on an M1 CPU. The tiny model apparently decodes an audio file twice as fast. These are promising results.

[0] https://news.ycombinator.com/item?id=32927360#32929739

lunixbochs3y ago

For an offline (non-streaming) model, 1x realtime is actually kind of bad, because you need to wait for the audio to be available before you can start processing it. So if you wait 10 seconds for someone to finish speaking, you won't have the result until 10 seconds after that.

You could use really small chunk sizes and process them in a streaming fashion, but that would impact accuracy, as you're significantly limiting available context.

dom963y ago

I'd be interested to see how well it performs on something like an RPi. M1 is pretty beefy.

olao993y ago

To be more precise the original comment said "M1 Max" which in itself is significantly beefier a bare "M1"

suyash3y ago

This is only one side of the coin, you still need really good models for Speech Synthesis and then be able to have it all working in almost real time, ideally locally on device.

ricopags3y ago

As far as TTS goes, Mycroft.ai[0] has released a decent offline one.

[0]https://mycroft.ai/

2 more replies

solarkraft3y ago

Are you thinking about reimplementing Mycroft?

The Mycroft has done a lot of cool and important work in the field to ship an actual personal assistant product (stuff like wake word detection).

dom963y ago

hah, of course someone had the idea already and executed on it. But yeah, basically that but without the screen (probably would go a long way to decrease the cost, $299 is pretty steep for such a device)

2 more replies

mwlp3y ago

Super impressive. I tested it on a Japanese streamer whose enunciation isn't exactly perfect and it did a decent job: https://www.youtube.com/watch?v=ROiOU1scaNA

  [00:00.000 --> 00:06.500]  Since the last one started, the number of times I've eaten has decreased.
  [00:06.500 --> 00:11.000]  If I get too carried away with the last one, I'll get hungry and do it.
  [00:11.000 --> 00:14.500]  I don't have time to eat.
  [00:15.500 --> 00:18.000]  I'm going to eat now.
  [00:20.000 --> 00:23.000]  It's going to take about 10 minutes from here.
  [00:23.000 --> 00:31.000]  It's been a while since I've had my last meal.
  [00:31.000 --> 00:36.000]  I feel like I'm losing my女子力.
  [00:36.000 --> 00:39.000]  I have to go back to my original self.
  [00:39.000 --> 00:44.000]  I have to get ready and go to bed.
  [00:44.000 --> 00:46.000]  It's not good.
  [00:46.000 --> 00:51.000]  I've been drinking a lot lately, so I'm going home.
  [00:51.000 --> 00:53.000]  I have to get my nails done this fall.
  [00:53.000 --> 00:54.000]  Halloween nails.
  [00:54.000 --> 00:57.000]  Halloween, Halloween, Halloween.
  [00:57.000 --> 00:59.000]  I'm going to the beauty salon today.
  [00:59.000 --> 01:02.000]  I'm going to get my nails done the day after tomorrow.
  [01:02.000 --> 01:10.000]  I used to look at a lot of clothes, but I stopped looking at them.
  [01:10.000 --> 01:12.000]  I'm going crazy.
  [01:12.000 --> 01:22.000]  My stomach's stopped in the middle of summer.

magicalhippo3y ago

It's struggling with Norwegian. Which I guess isn't shocking. The large model performs a fair bit better than the small, though neither is "good".

Though I assume the amount of Norwegian it has been exposed to is fairly limited, so in that light I'm actually impressed as well.

I tried it on a news segment from the radio[1], this is the large model output:

    [00:14.000 --> 00:17.200]  En skamløs krenking av FN pakten.
    [00:17.200 --> 00:24.000]  USAs president og verdensledere svarer på den russiske presidentens atomtrusler og krigsmobilisering.
    [00:25.500 --> 00:29.400]  Arbeidsklær som er ment til å være til begge kjønn, har det med å være tilpasset.
    [00:29.400 --> 00:33.400]  Men hvordan ville det gått, om det var motsatt?
    [00:34.100 --> 00:38.900]  Dyrevernsorganisasjon vil ha digital merking av regnstyr,
    [00:38.900 --> 00:44.900]  men næringen selv insisterer på den gamle tradisjonsrike måten med rissing av kniv.
    [00:45.600 --> 00:51.400]  Mange strømselskaper er positive til å tilby kundene fastpris på strøm, og det årevis.
    [00:51.400 --> 00:59.900]  Da risikerer de å måtte betale mye i nettopp åretsvis, sier aktører som aldri tilbyr fastpris.
    [00:59.900 --> 01:21.900]  Dette er onsdagens Dagsnytten. Jeg heter Espen Ås.

For reference, here's what he actually said, from the source[1] itself:

    * En skamløs krenking av FN-pakten. USAs president og verdensledere svarer på den russiske presidentens atomtrusler og krigsmobilisering.
    * Arbeidsklær som er ment å være til begge kjønn, er som regel tilpasset ... menn. Hvordan hadde det gått om det var motsatt?
    * Dyrevernsoganisasjon vil ha digital merking av reinsdyr, men næringen selv insisterer på den gamle tradisjonsrike måten med rissing av kniv.
    * Mange strømselskaper er positive til å tilby kundene fastpris på strøm - og det i årevis.
    - Da risikerer de å måtte betale mye i nettopp; årevis, sier aktør som aldri tilbyr fastpris
    Dette er onsdagens Dagsnytt 18 - jeg heter Espen Aas.

The translation didn't fare that well though:

    [00:14.000 --> 00:17.000]  A shameless violation of the UN treaty.
    [00:17.000 --> 00:24.000]  The US president and world leaders respond to the Russian president's nuclear threats and war mobilization.
    [00:24.000 --> 00:33.000]  Work clothes that are meant to be for both genders have to be suitable, but how would it be if it was the other way around?
    [00:34.000 --> 00:44.000]  The animal welfare organization will have a digital marking of reindeer, but the industry itself insists on the old traditional way of tearing a knife.
    [00:45.000 --> 00:51.000]  Many electricity companies are positive in offering customers fixed electricity prices, and that is annual.
    [00:51.000 --> 00:58.000]  Then they risk having to pay a lot in just a year, says an actor who has never offered fixed prices.
    [00:58.000 --> 01:20.000]  This is Wednesday's Dagsnytt 18. My name is Espen Ås.

For reference, here's Google Translate's attempt, which is pretty good:

    * A shameless violation of the UN Charter. The US president and world leaders respond to the Russian president's nuclear threats and war mobilization.
    * Work clothes intended for both sexes are usually adapted to ... men. How would it have gone if it had been the other way around?
    * Animal welfare organizations want digital marking of reindeer, but the industry itself insists on the old, traditional way of marking with a knife.
    * Many electricity companies are positive about offering customers a fixed price for electricity - and for years.
    - Then they risk having to pay a lot in precisely; for years, says a player who never offers a fixed price
    This is Wednesday's Dagsnytt 18 - my name is Espen Aas.

[1]: https://radio.nrk.no/podkast/dagsnytt_atten/l_5ce3e323-97a3-... (not sure if it's available outside of Norway)

magicalhippo3y ago

Re-reading the transcription, I guess I was a bit harsh by saying it's not "good". It gets most of it right, but it keeps messing up some key words. Like "regnstyr" (not a word) rather than "reinsdyr" (reindeer), or "Dagsnytten" rather than "Dagsnytt 18".

It also didn't handle the hanging "... menn", instead thinking it was the start of the following sentence. Almost everyone would understand it was the end of the sentence based on the context.

The double-A vs Å is not an issue as it's the same letter, double-A is the older form.

The small model was considerably worse than the large one though.

1 more reply

perlgeek3y ago

Everything (and everyone, including myself :D ) seem to struggle with Norwegian, it seems the corpus size is simply too small. And/or maybe the market.

Deepl didn't do any Norwegian last I looked, even though it does most other Germanic languages (including Danish and Swedish).

Duolingo doesn't have a Norwegian class for Germans either, though they do have one with English as the source language.

olao993y ago

How are you getting the transcription of the NRK episode? I am learning Norwegian and often struggle to find reliable transcriptions for audio where the text exactly matches the audio (often subtitles are heavily edited compared to what's actually being said)

1 more reply

alach113y ago

How long until this gets implemented in Twitch? Real-time subtitles for any stream in the language of your choice?! That would be huge.

adeptima3y ago

translation is not the strongest part. transcription looks very good.

shpx3y ago

We shouldn't call this open source. The model definition + the data is the source code. The model weights are a compilation artifact.

> The source code must be the preferred form in which a programmer would modify the program. [...] Intermediate forms such as the output of a preprocessor or translator are not allowed.

> https://opensource.org/osd

If I asked a programmer from OpenAI to modify the model to better support Japanese speakers from Hokkaido, their "preferred form" of the model's source code would include the 680,000 hours of audio used to train the model.

Yes that means that there are almost no open source models and yes it's awesome that they released this and made the weights available. Just don't call it open source.

pabs33y ago

The Debian deep learning team's machine learning policy would call this a "toxic candy" model:

https://salsa.debian.org/deeplearning-team/ml-policy

BTW, wouldn't you take the existing model and do additional Hokkaido Japanese speaker training on top of it, rather than retraining the model from scratch?

rvz3y ago

Yes. It just like calling the release of compiled closed binary blobs as 'open source' even when the source of reproducing the compiled output is unavailable.

> If I asked a programmer from OpenAI to modify the model to better support Japanese speakers from Hokkaido, their "preferred form" of the model's source code would include the 680,000 hours of audio used to train the model.

Precisely. These 'users' lifting the model can't do it themselves. You will still be contacting OpenAI for support or to add support for another language and they will be the ones able to modify the model.

> Just don't call it open source.

That is true, it is still closed source and already we are seeing the hype squad already apologising to OpenAI as they 'open sourced' a closed model that you can't modify yourself.

OpenAI is still business as usual and nothing has changed.

MacsHeadroom3y ago

>You will still be contacting OpenAI for support or to add support for another language and they will be the ones able to modify the model.

This isn't quite correct. The model weights are all you need to fine tune the data on your own with your own audio.

Without the original training set this still isn't open source. But you aren't powerless to modify the model without the original training set.

nl3y ago

This isn't really true.

You can do a lot with weights and no training data - for example you can pull the end layer off it and use it as a feature extractor.

And to modify it for Japanese speakers you'd fine train the existing model on additional data. If you wanted to modify the model you can (sometimes, depending on what you want to do) modify an existing architecture by removing layers, adding replacements and fine tuning.

I don't quite know what the right analogy of trained data is. In many ways it is more valuable than the training data because the compute needed to generate it is significant. In other ways it is nice to be able to inspect the data.

> The source code must be the preferred form in which a programmer would modify the program.

As a machine learning programmer I'd much prefer the weights than the raw data. It's no realistic for me to use that training data in any way with any compute I have access to.

toss13y ago

Like every model I've seen there is something like this:

>>A decoder is trained to predict the corresponding text...

Prediction of expected text in the context of the previous text.

While this is valuable in casual transcription, it can be extremely dangerous in serious contexts.

From personal experience, having given a deposition with an "AI" transcription, it will literally reverse the meanings of sentences.

This is because it produces the EXPECTED output in a context, and NOT THE ACTUAL OUTPUT.

Like a speaker that clips the output, these types of systems 'clip' the really valuable information out of a transcription. Worse yet, this is a completely silent failure, as the transcript LOOKS really good.

Basic info theory shows that there is more information contained in 'surprising' chunks of data than in expected ones. These systems actively work to substitute 'expected' speech to overwrite 'surprising' speech.

The transcript I got was utter trash, multiple pages of errata I had to submit when the normal is a couple of lines. And as I said, some literally reversed the meaning in a consequential way, and yet completely silently.

This kind of silent active failure mode is terrifying. Unless it is solved, and I see no way to solve it without removing ALL predictive algos from the system, these types of systems must not be used in any situation of serious consequence, at least not without real redundancy and backup.

Tomis023y ago

I've been saying this for years. Current "AI" algorithm are fundamentally flawed because they rely on a statistical approach. This works moderately well for some use cases but it will rarely give you 100% confidence. Good luck with self-flying planes or self-running nuclear power plants.

toss13y ago

>>Current "AI" algorithms are fundamentally flawed because they rely on a statistical approach.

YES! The old joke about "Artificial Stupidity" is actually more true than anyone realized.

These statistical so-called-AI systems actually work to actively REMOVE or sanitize out any unexpected information, making it all conform with the EXPECTED results from the training set.

This not only REMOVES the most high-information 'surprising' or unexpected nuggets, it actively HIDES them. When something unexpected comes up, it gets force fit into the expected prediction algorithms and output as if it were good.

I'm not saying that there are no useful things that can be done with this technology — there is a LOT of mundane work out there to be done.

But, we will never get this type of "AI" saying "Huh, that's odd, I wonder why that is?", which is exactly the kind of observation that leads a prepared and fertile mind to great discoveries.

lunixbochs3y ago

Do you have a demo audio clip for this? I'd be interested to see how it looks in practice.

toss13y ago

Sorry, I don't have anything available.

One item I remember was that I said "Dr Kemeny" in relation to Dartmouth College (he was a famous mathematician, invented the BASIC programming language and was president of the college). It replaced those instances with "Jack Kennedy".

In another instance, I said that "Evidently, you have a reading comprehension problem.". It replaced it with "Evidently, I have a ...", completely reversing the meaning.

There was zero problems with the microphones or audio, and it was not rushed or mumbled talk. There were 80+ other examples over a few hours of talking, and some from other speakers. And those were just the obvious ones I could catch.

Another massive problem with this technology is that a human stenographer can notice when s/he missed something and didn't hear and ask the speaker to repeat or clarify what was said, and will often during a pause request clarification on spelling of names, addresses, etc. In contrast, this "AI" technology just barges ahead ASSuming that it knows what it is doing and inserts literally whatever sounds good in the transcript, completely silent that it doesn't have a clue.

Having seen this up close, I'm of the strong opinion that anyone foisting this software on the market without huge warnings that this is not usable for any critical functions is, basically a fraud. They know or certainly should know that these failures not only exist but are common and systemic, yet they barge along like it is OK. It is not.

eatsyourtacos3y ago

Can this be used as a real-time transcription or is it too slow for that?

Curious what anyone is using these days for a real-time transcription. It doesn't have to be perfect, but just good enough.

My kids watch some youtube vidoes where people will make a mod where it converts them talking to text then look for keywords and spawn a boss in Terraria if you say the wrong keyword etc.

I made a clone of that with the .NET System.Speech.Recognition library. It... works.. but my biggest problem is that #1 it waits until you are done speaking to translate to text on the callback, so there was too much of a delay for it to be fun.. the point is that it will be checking a stream of chatter. #2 is the recognition is pretty crap, I mean it's nearly good enough for my silly purpose but it's still pretty bad.

NaturalPhallacy3y ago

I tried it out and it's way too slow on my machine that is no slouch (Ryzen 9 5950/GTX 3080).

It's doing seconds of translation per minute for me at least.

whimsicalism3y ago

It might require too much work for what you are looking for, but the wav2letter library is the best real-time transcription OSS I have found by a considerable margin.

davidzweig3y ago

Out of interest, did you try Nemo? https://github.com/NVIDIA/NeMo

1 more reply

blueberrychpstx3y ago

If your family uses Apple devices, Apple offers free on-device speech recognition. Only caveat is that it needs to be restarted every minute due to whatever stupid limitation (or bug) they've introduced.

https://developer.apple.com/documentation/speech/recognizing...

Also, see `requiresOnDeviceRecognition`

1 more reply

TaylorAlexander3y ago

The base model seems to run faster than real time on my machine. The “medium” model is larger and runs more slowly - roughly real time or maybe slightly slower.

nshm3y ago

Try https://github.com/alphacep/vosk-api/blob/master/csharp/demo...

jayavanth3y ago

thuttinger posted in this thread: https://github.com/tobiashuttinger/openai-whisper-realtime

suyash3y ago

Depends if you're trying to run it offline or over the cloud.

StevenWaterman3y ago

That example at the top of the page (speed talking) blew me away. He started talking, I was stunned for a minute, then realised yes, it really was English, and I just burst out laughing.

That's so, so far beyond the previous state-of-the-art, it's absurd.

NaturalPhallacy3y ago

It's a micromachines ad from the '80s. He talked like that in all of them!

As for speed, to a computer we don't talk very fast, not even that guy.

I wonder if it could handle Rap God by Eminem....Let's find out!

dreamer73y ago

Did you find out :D?

3 more replies

TOMDM3y ago

Given how robust it seems to be with fast speech, I wonder if you could save cycles by speeding up the audio before feeding it in.

The5thElephant3y ago

How is it Apple, Google, or Microsoft are not further ahead of the game on speech recognition like this? They have the resources to hire the best ML researchers and throw tons of computing hours at it, yet Siri, Google, and Cortana continue to struggle to get anywhere near this level of comprehension.

wongarsu3y ago

Siri and Cortana have to run at least in real time, with reasonable compute resources. Probably faster than real time when the audio gets shipped off to the cloud and transcribed there. This model can't do that (in the "large" version, which the examples use).

Also, you are comparing Whisper's highlight reel with everyday performance of other models. Nobody shows their weaknesses in their highlight reel.

coder5433y ago

Someone else in this thread[0] said Whisper was running at 17x real time for them. So, even a weak machine might be able to do an acceptable approximation of real time with Whisper.

Also, I feel like shipping to the cloud and back has been shown to be just as fast as on device transcription in a lot of scenarios. Doing it on device is primarily a benefit for privacy and offline, not necessarily latency. (Although, increasingly powerful smartphone hardware is starting to give the latency edge to local processing.)

Siri's dictation has had such terrible accuracy for me (an American English speaker without a particularly strong regional accent) and everyone else I know for so many years that it is just a joke in my family. Google and Microsoft have much higher accuracy in their models. The bar is so low for Siri that I automatically wonder how much Whisper is beating Siri in accuracy... because I assume it has to be better than that.

I really wish there was an easy demo for Whisper that I could try out.

[0]: https://news.ycombinator.com/item?id=32928207

2 more replies

The5thElephant3y ago

Good point about realtime or not, however with ML I have found the weaknesses get addressed pretty fast by someone. There is a big step between proof of concept and practical application though, so we shall see.

alex_marchant3y ago

Siri until ios 15 was done in the cloud IIRC.

fxtentacle3y ago

This AI has a 30 second delay on the audio processing because it needs to be able to "look into the future" to get these good results. That 30s delay would be unacceptable for Siri/Google/Cortana.

coder5433y ago

A lot of models we currently use seem to do the same thing. The model will transcribe a "best effort" interpretation in real time, then as you can continue speaking, you'll see it go back and make corrections. I'm sure you can feed the first X seconds you have into the model, followed by (30-X) seconds of silence, and it will do real time transcription just fine... it would be weird if this broke anything. Then, as you get more speech, you continue getting better transcription of the first 30 seconds, then you switch to a 30 second sliding window.

Maybe I'm missing something, but I don't see the problem here.

1 more reply

beastman823y ago

In my unmeasured empirical observation Google has amazing speech recognition

jeffbee3y ago

I tried feeding the four examples from this announcement into Google as dictation inputs and it just sits there blankly. On the JFK speech test file in the repo, Google understands perfectly. The samples in the announcement are clearly outside the capabilities of anything Google has launched publicly, but I don't know how that translates to overall utility in every day applications.

The5thElephant3y ago

I agree they have the best compared to Apple, Amazon, Microsoft. However I don't think it is as good as what is being shown here by OpenAI.

1 more reply

Kuinox3y ago

OpenAI is owned by Microsoft FYI.

neongreen3y ago

Is it? Googling suggests that Microsoft invested in OpenAI but doesn’t actually own it.

1 more reply

mmh00003y ago

Okay this is super impressive. I just downloaded Whisper and fed it a random flac file I had handy and it did a really good job. Also impressive that it works on my weak CPU:

A 3m07s flac took 5m to transcribe:

  $ whisper --device cpu 'BLACKPINK - BORN PINK/01 Pink Venom.flac'
  Detecting language using up to the first 30 seconds. Use `--language` to specify the language
  Detected language: korean
  [00:00.000 --> 00:10.000]  Blackpink
  [00:11.000 --> 00:14.000]  Kick in the door, wave in the coco
  [00:14.000 --> 00:16.000]  팝콘이는 친게 껴들 생각 말고
  [00:16.000 --> 00:19.000]  I talk to talk, run ways I walk walk
  [00:19.000 --> 00:21.000]  힘 감고 팝 팝 안 봐도 척
  [00:21.000 --> 00:24.000]  By one and two by two
  [00:24.000 --> 00:26.000]  내 손끝 두 하나에 타면 아지은 중
  [00:26.000 --> 00:30.000]  갓 자쇼 지금 화려해 T makes no sense
  [00:30.000 --> 00:32.000]  You couldn't get a dollar out of me
  [00:33.000 --> 00:38.000]  자 오늘 밤이야 눈톱을 품고
  [00:38.000 --> 00:41.000]  미혼을 뺏음 down
  [00:41.000 --> 00:43.000]  Look what you made us do
  [00:43.000 --> 00:47.000]  천천히 널 잠재울 파이어
  [00:48.000 --> 00:52.000]  잠이 날 만큼 아름다워
  [00:52.000 --> 00:53.000]  I bring the pain like
  [00:53.000 --> 00:57.000]  디스탑, 팽팽, 디스탑, 팽팽, 디스탑, 팽팽, 팽팽
  [00:57.000 --> 00:58.000]  Get em, get em, get em
  [00:58.000 --> 01:00.000]  Straight till you don't like
  [01:00.000 --> 01:01.000]  Whoa, whoa, whoa
  [01:01.000 --> 01:03.000]  Straight till you don't like
  [01:03.000 --> 01:04.000]  Ah, ah, ah
  [01:04.000 --> 01:05.000]  Taste that, pink venom
  [01:05.000 --> 01:06.000]  Taste that, pink venom
  [01:06.000 --> 01:08.000]  Taste that, pink venom
  [01:08.000 --> 01:09.000]  Get em, get em, get em
  [01:09.000 --> 01:11.000]  Straight till you don't like
  [01:11.000 --> 01:12.000]  Whoa, whoa, whoa
  [01:12.000 --> 01:13.000]  Straight till you don't like
  [01:13.000 --> 01:14.000]  Ah, ah, ah
  [01:14.000 --> 01:15.000]  Blackpink and Amo
  [01:15.000 --> 01:17.000]  Got it by the smack ram
  [01:17.000 --> 01:18.000]  But rest in peace
  [01:18.000 --> 01:19.000]  Please light up a candle
  [01:19.000 --> 01:20.000]  This the knife of a vando
  [01:20.000 --> 01:22.000]  Messed up and I'm still in saline
  …SNIP…

lunixbochs3y ago

Looks like it defaults to the model called "small".

I just ran some benchmarks - M1 Max, pytorch, with a 1.29 second flac (looks like the matrix math was running on a single thread):

    tiny
    146.522ms detect_lang
    549.131ms decode_one
    0.057ms tokenizer

    base
    354.885ms detect_lang
    1046.679ms decode_one
    0.011ms tokenizer

    small
    803.892ms detect_lang
    3194.503ms decode_one
    0.017ms tokenizer

    medium
    2279.689ms detect_lang
    10128.255ms decode_one
    0.023ms tokenizer

    large
    3656.478ms detect_lang
    17249.024ms decode_one
    0.016ms tokenizer

adgjlsfhk13y ago

For more benchmarks on an rtx 2060 (6gb), the "small" model for me is roughly 10x real-time and the tiny model is 30x real-time.

no1youknowz3y ago

This is awesome. But I really want the other way.

To be able to give it text and hear the speech. A TTS (text to speech).

As a language learner, the ability to create my own sentences (based on existing ones I have, in changing a word here or there). Would be amazing.

How long till we have this I wonder. I know I could use a service to do this currently. But having something running locally, I'd prefer.

Hopefully someone in the OpenAI team reads this. :)

TaylorAlexander3y ago

I suspect this is coming. I mean we do have decent text to speech systems already, but in this vein of “we used neural networks and now it’s very very good” you can imagine that with something like GPT-3, to extend it they could use this speech to text system so you could speak to it for input, and then a natural progression is that it can use text to speech to return the output, so you just have a voice oriented conversational system.

So I think TTS is a logical part of the system. I also think that there are peculiarities of voice interaction that aren’t captured in text training datasets, so they would need to do some fine tuning on actual voice conversation to make it feel natural.

All in due time I suppose.

visarga3y ago

A full NLP system would include speech recognition, TTS, a large language model, and a vector search engine. The LM should be multi modal, multi language and multi task, "multi-multi-model" for short haha. I'm wondering when we'll have this stack as default on all OSes. We want to be able to search, transcribe, generate speech, run NLP tasks on the language model and integrate with external APIs by intent detection.

On the search part there are lots of vector search companies - Weaviate, Deepset Haystack, Milvus, Pinecone, Vespa, Vald, GSI and Qdrant. But it has not become generally deployed on most systems, people are just finding out about the new search system. Large language models are still difficult to run locally. And all these models would require plenty of RAM and GPU. So the entry barrier is still high.

1 more reply

freedomben3y ago

Likewise, TTS is what I really want. My goal is to be able to create audio books from text. I've been using Amazon Polly and it's acceptable quality, but I would be ecstatic to be able to do it locally on my own hardware.

visarga3y ago

Check out NaturalReader. It has hundreds of amazing voices, a system for highlighting text as it is being read, works on books (pdf) and webpages, and is available on phones and in browsers on all platforms. So I could have the same voice on Mac, Linux and iPhone.

wongarsu3y ago

> About a third of Whisper’s audio dataset is non-English, and it is alternately given the task of transcribing in the original language or translating to English. We find this approach is particularly effective at learning speech to text translation and outperforms the supervised SOTA on CoVoST2 to English translation zero-shot.

That's intriguing. You can just set the model to transcribe everything into English, no matter which language the speaker is using, and it just works. Given that many people are much better at understanding English than at speaking it, this might make voice interfaces much more accessible without much work.

w10-13y ago

Naively, training the same model on multiple languages has interesting implications.

On one hand, it may capture something "deeper" about language.

On the other hand, it's likely to do great in general, but miss particularities of some language.

Understanding the coverage of the training model seems a perennial problem. Is there any (shorthand) way to compare language model training corpora?

Clearly if they use common subsets we have a literal comparison. I'm more interested in whether there's progress in characterizing corpora by speech styles, fluency, vocabulary sets, (noise) environment, emotionality, proposition types, etc.

(btw: 25 minutes for a 9-minute segment on a 12-thread x86. Lots of jargon spelled as it sounds. Sentences capitalized but no punctuation. Overall good.)

nik_s3y ago

I just tested the model [1] using an RTX3090, trying to translate a french text I found here [2].

Some observations:

- The full translation of the 6:22 minute video takes about 22 seconds (17x real time)

- It recognizes the language by default (and did a good job to recognize it was french audio)

- MIT License [3]!

- The quality of the transcription is good, but not perfect.

- The quality of the translation (if you don't consider transcription errors as a translation error) is generally very good.

---

The transcription:

> Bonjour à tous, <error>j'suis</error> espère que vous allez bien, c''est ENTI. Et aujourd', <error>aujourd',</error> on se retrouve <error>un peu physique</error> pour parler de la termo dynamique. Vous ne vous inquiétez pas, ça va bien se passer. On va y aller ensemble, <error>être à par exemple,</error> je vous accompagne à travers une série de vidéos pour vous expliquer les principes de base en termo dynamique. Et bah, c''est parti, on va y aller tranquillement. Lidée, c''est vous puissiez comprendre la termo dynamique dans son ensemble. Donc, je vais vraiment prendre mon temps pour <error>couplisser</error> bien comprendre les notions,

The translation:

> Hello everyone, I hope you're doing well, it's NT and today we find ourselves a little physical to talk about the thermo dynamic. Don't worry, it's going well, we're going to go together and be the same. I'm going to accompany you through a series of videos to explain the basic principles in thermo dynamic. Well, let's go, <error>we're going to go quietly</error>. The idea is that you can understand the thermo dynamic <error>in sound together</error>. So I'm really going to take my time to understand the notions,

---

All in all very happy that OpenAI is publishing their models. If Stable Diffusion is any guide, people will hack some crazy things with this.

[1] https://github.com/openai/whisper [2] https://www.youtube.com/watch?v=OFLt-KL0K7Y [3] https://github.com/openai/whisper/blob/main/LICENSE

joshcryer3y ago

It also runs well on a CPU and seems to have proper memory management. Wonderful timing because I was using DeepSpeech for some audio recordings and it required me to script up a splitter to make the files into .wav and then do snippets of 10 seconds each. Everything about this just works out of the box. On a core i5 I'm getting about 30 seconds every minute. Transcriptionist jobs just turned into editor jobs. I love how it drops the inflections in the audio as well, because it was trained on transcription work, and that is one of the first things you learn to do (drop the uhs and ums and huhs etc, unless it is a strictly verbose transcription).

seszett3y ago

> dans son ensemble

> in sound together

That's hilarious and honestly, incredibly bad. "Dans son ensemble" is a very common idiom (meaning "as a whole") while "in sound together" has to be pretty rare. "Son" means "his/hers/its" as well as "sound", and the former meaning is probably more common in general so I have no idea how this result could arise.

"Termo" also doesn't exist in French, it's "thermo", so the transcript even makes orthographic errors.

And I forgot about "couplisser" which is also a hilarious made-up word that sounds like it could mean something, but doesn't! Edit Google finds exactly one reference of this, in a patent with a typo on the word "coulisser".

I'm still impressed by the transcript quality since it covers many languages, but the translation part is quite poor.

StevenWaterman3y ago

Was this with the `base` model? `large` is running ok on a P100 in colab, but is about 4% the speed of `base.en`. Certainly seems like some of these models will be fast enough for real-time.

solarmist3y ago

Is it translation or transcription? Or both?

Both, wow. This is really interesting.

StevenWaterman3y ago

Both, the blog covers it in detail. Pass in audio in any language, and get an English transcription out.

nik_s3y ago

It can do both - I've edited my original post to show the translation task.

NaturalPhallacy3y ago

How did you get it to use the GPU?

I have it running right now and it's not touching the GPU.

ramblerman3y ago

--device "cuda"

1 more reply

goffi3y ago

Really interesting, I can see ton of potential uses.

2 questions:

1) how does it compare to state of the art FOSS solutions? I'm seeking about DeepSpeech or Vosk

2) would it be somehow possible to associate timestamp to the words recognized? That would be amazing for things such as audio editing or skipping to a particular location on a video

nshm3y ago

You properly mentioned timestamps. There are many other important properties of good ASR system like vocabulary adaptability (if you can introduce new words) or streaming. Or confidences. Or latency of the output. Compared to Vosk models this model can not work in streaming manner, so not very suitable for real-time applications.

But in general the model is robust and accurate and trained on the amount of speech we never dreamed about in Vosk. We will certainly benefit from this model as a teacher (together with others like gigaspeech models). I recently wrote about it https://alphacephei.com/nsh/2022/06/14/voting.html

goffi3y ago

> goffi

for 2), it's actually written in the description: "phrase-level timestamps", so it should be possible (phrase level is neat for skipping to a special location on a video, but maybe not for audio editing).

isoprophlex3y ago

Really incredible to see that their multilingual audio-to-English approach is viable. I'm super excited about this, and great to see that openai actually open up about something, for once.

Skimming the codebase I can't immediately see code to do additional training.

Being able to fine-tune the model to a specific language or case (eg. teach it specifically about some technical topic that might not be so prevalent in the current train set) would be majorly disruptive to current SOTA in "callcenter analytics" tech. Especially when combining Whisper with GPT3.

sowbug3y ago

I knew there was a reason why I kept my MP3 library even after subscribing to Spotify. Now piping everything through whisper. So far the generated lyrics are reasonable, though it thinks the REM song says "Linnie Bruce is not afraid."

No surprise that it appears to have successfully transcribed all the recordings of Harvard Sentences I could find. https://en.wikipedia.org/wiki/Harvard_sentences

lynguist3y ago

How can I use this (or something similar) for live translation? I don't mind if there's a 30s delay.

As in I don't want to input a file, I want to input the microphone sound.

blueberrychpstx3y ago

Was wondering the same.

I really wish I would have been paying attention in Unix class...

Something like `microphone | chunk 3s | whisper | stdout` would be SO COOL!!! I think that's possible but too lazy to look more.

agnos3y ago

Would also like to know this. It looks like they're processing the audio file in 30 second chunks, so a naive approach of keeping a buffer of 30-second input stream chunks and just continually writing to an output .mp3 could work...

minimaxir3y ago

The model output can be tweaked to produce audio embeddings (akin to BERT for text embeddings and CLIP for image embeddings), which can lead to some interesting applications as the previous two examples have demonstrated.

FerociousTimes3y ago

What do you mean exactly by audio embeddings?

minimaxir3y ago

Represent a given set of audio inputs as a numeric vector, which can then for example be finetuned for other ML/AI problems or placed in an embeddings database for easy ANN search with similar audio clips. In the extreme case it could facilitate better AI audio generation similar to how CLIP can guide a VQGAN.

Although the 30 second minimum input is a bit of a bummer since it may not allow much granularity in the resulting embeddings.

aidenn03y ago

I just threw a random rock MP3 at it, and a first readthrough shows no transcription errors; this is quite good.

Now I just want OCR that's even 50% as good as this...

aidenn03y ago

Ran a few other songs through it and found one obvious mistranscription:

"He's the bedroom cosmic rocker" (should be "He's the veteran cosmic rocker" in Veteran Cosmic Rocker by The Moody Blues)

I also noticed that it's a little on the conservative side for detecting speech; all songs were missing at least part of one line.

aidenn03y ago

Ran it on Juicy by The Notorious B.I.G and results were considerably worse than my mix of prog-rock and british invasion music I had tried before, though at least some of that is due to the number of proper-nouns in that song.

It took about 1000 CPU-minutes for this 5 minute song on my Ryzen 2700 with 12 OpenMP threads (about 100 minutes wall-clock).

1 more reply

macrolocal3y ago

For what it's worth, even the large model balks on Easy (Aesop Rock), eg.

"Fountainheads spittle sniglets quicker than quidditch seekers snatch golden snitches."

becomes

"Stirred up out mids bittles, snicklets, cricket and quidditch seekers net golden snitches."

¯\_(ツ)_/¯

1 more reply

Jnr3y ago

Cool!

I am one of the top contributors to the tiny Mozilla Common Voice data-set for my language. The data-set is very small compared to other popular languages and none of the other mentioned data-sets contribute to that language to train the model of Whisper.

And even with so little data to train on it still works surprisingly well.

archon14103y ago

Where do they mention what datasets they've used? I've tried looking at the paper but can't find it.

archon14103y ago

Nevermind: I found it. It's on page 19 and 20 of the paper, under Appendix A ("Evaluation Datasets").

catfan3y ago

[zalgo redacted]

dang3y ago

Hey - can you please not zalgo on HN? It messes up the threads. I've redacted it from your posts now.

IceWreck3y ago

Is there a list of system requirements somewhere ? Can it run on cheaper low memory GPUs ? maybe CPUs ?

StevenWaterman3y ago

Their models range from 70mb to 3gb. The largest model is smaller than the optimised stable diffusion. Not sure what the inference speed is like, haven't tried it myself yet.

IceWreck3y ago

I just tested it myself. Its fast enough on colab, couple of seconds but not sure if its fast enough to transcribe realtime audio yet.

2 more replies

yjftsjthsd-h3y ago

On my ancient desktop it happily fell back to running on CPU just fine.

jcims3y ago

Did respectably with some mumble rap: https://controlc.com/d353dafb

(some NSFW words in the lyrics obv)

derangedHorse3y ago

Whisper performed a lot better than I would've expected it to!

aidenn03y ago

For those on NixOS, here's a quick and dirty flake.nix that will let you make a venv in which to "pip install"'

Just put it in a flake.nix, and "nix develop" followed by "virtualenv ./venv; . ./venv/bin/activate; pip install git+https://github.com/openai/whisper.git"

    {
      description = "Python 3.9 development environment";

      outputs = { self, nixpkgs }:
        let
          system = "x86_64-linux";
          pkgs = import nixpkgs { inherit system; };
        in {
          devShells.${system}.default = pkgs.mkShell {
            buildInputs = [
              pkgs.ffmpeg
              pkgs.python39
              pkgs.python39Packages.pip
              pkgs.python39Packages.numpy
              pkgs.python39Packages.pytorch
              pkgs.python39Packages.virtualenv
            ];
          };
        };
    }

aidenn03y ago

This should, in theory, work with CUDA; my GPU doesn't have enough RAM to do it (it runs out at 2.9GiB allocated, I have 4GiB, but am running a compositing desktop, which chews up about 600MiB; not sure where the other ~400MiB went)

[edit]

I confirmed CUDA worked with the "small" model, which used 3.3GB of GPU ram, and resulted in much poorer recognition than the "medium" model on my CPU (but it ran at least two orders of magnitude faster).

    {
      description = "Python 3.9 development environment";
      outputs = { self, nixpkgs }:
      let
        system = "x86_64-linux";
        pkgs = import nixpkgs {
          inherit system;
          config.allowUnfree = true;
          config.cudaSupport = true;
        };
      in {
        devShells.${system}.default = pkgs.mkShell {
          buildInputs = with pkgs; [
            cudatoolkit linuxPackages.nvidia_x11
            cudaPackages.cudnn
            libGLU libGL
            xorg.libXi xorg.libXmu freeglut
            xorg.libXext xorg.libX11 xorg.libXv xorg.libXrandr zlib 
            ncurses5 stdenv.cc binutils
            ffmpeg
            python39
            python39Packages.pip
            python39Packages.numpy
            python39Packages.pytorch-bin
            python39Packages.virtualenv
          ];

          shellHook = ''
              export LD_LIBRARY_PATH="${pkgs.linuxPackages.nvidia_x11}/lib"
          '';          
        };
      };
    }

magicalhippo3y ago

CUDA worked fine with large on my 2080Ti FWIW. The speedup is ridiculous, as expected. My Ryzen 3800X used almost an hour transcribing a minute worth of speech, while the 2080Ti does it in like 10-20 seconds.

1 more reply

sergiotapia3y ago

Does this work with multiple speakers?

I want to build a tool that takes a video and generates subtitles for it, then I want to index the subtitles and let people search for a specific quote to scrub to that part of the video using automatically generated urls.

This is for a specific fandom of a ton of content, lots of dirty audio mostly recorded in a gym setting with multiple people speaking.

867-53093y ago

pretty sure such a tool made HN front page a few months ago

londons_explore3y ago

I've never seen transcription and translation combined into a single step like this before...

Have I been living under a rock, or is this new?

I assume it should help performance, because it means emphasis, timing and tone can be used to inform the translation. Helps make better guesses about information missing from the source language.

dindindin3y ago

I'm not in the Speech Recognition circles and am looking for open source speech recognition I can play around with - would this be the new state of the art?

mercurywells3y ago

For me as a deaf person the current state of art (in terms of speed & usability) is the Recorder app on a Google Pixel phone (4a/6 Pro is what I've used)

visarga3y ago

Most probably

StevenWaterman3y ago

Yes

amrrs3y ago

Here's a live demo on Hugging Face Spaces if you want to try - https://huggingface.co/spaces/Amrrs/openai-whisper-live-tran...

coder5433y ago

I've tried speaking to that demo several times... I used the built in feature to record from microphone, and I played back the samples to make sure they were audible and clear.

Sometimes it outputs the words "thank you" (which I did not say), sometimes it outputs a period. It never once output anything I said. It seems completely broken.

EDIT: apparently something about the combination of Safari+HF+Whisper was not working. I tried another Whisper demo on HF and had the same results. Switching to Chrome made it work flawlessly... I have no idea what kind of codec incompatibility was happening.

clemnt3y ago

this is amazing! got it working in French too

kiwih3y ago

Given this, are there good (and available/open source) models for text to speech? Last time I tried everything still sounded extremely robotic, and/or were a pain to set up and run. It would be fun to set up a pipeline where the two processes 'communicate'.

obscur3y ago

Measuring performance in rounds of successful Chinese whisper

(irony)

graderjs3y ago

The big question is why is Google's speech recognition in Gboard voice typing still so shit?

https://news.ycombinator.com/item?id=32862172

MIT licensed model seems way better

nicholasjarnold3y ago

This is so cool! I was just speaking to a non-technical family member about privacy concerns around using "OK Google" and the like. They responded inquiring about "private" alternatives, to which my answer was "I'm not aware of good ones that give you that level of accuracy and convenience."

Perhaps this development along with continued optimization and device compute power increases will lead us into a near-future where things like Mycroft devices and cellphones could have local-only speech-to-text and translation capabilities which are accurate even with environmental background noise variations encountered IRL.

Great work OpenAI team!

BasilPH3y ago

Any opinions on what this means for speech-to-text companies like rev.ai and assmembly.ai ?

We've tested open source solutions for s2t, like kaldi, but the quality was not good enough. However, one of the main advantages of a service like assembly.ai to me was that they offer sentence splitting in form of punctuation and speaker detection, which Kaldi does not.

So I guess I answered my own question to some degree: A S2T service is more than just S2T. We already see assembly.ai add more and more features (like summarisation, PID redaction ect.) that are a value-add to plain S2T.

Still, curious to hear what your take on that is.

nshm3y ago

You can apply public punctation model from Vosk on top of Kaldi output, you can also get speaker labels with existing open source software.

On quick video transcription test this model is more accurate than AssemblyAI and Rev AI. It will be harder for them to sell pure ASR now. Some more business-oriented applications will still be important though, for example ASR as part of callcenter analytics solution or as a part of medical ERP system.

The value of automatic summarization is small, without AI it is very hard to make it right, you need to be an expert in the field to understand what is important.

sjnair963y ago

> you can also get speaker labels with existing open source software.

Hello Nickolay :)

Diarization has always been the hard part for me, especially since it is very difficult to do comparisons within your domain. The evaluation metrics are not descriptive enough imo.

Would you say Titanet or EcapaTDNN are decent for use in production alongside, say, Whisper, or any other ASR output, if given the timestamps, so as to bypass running VAD? I'm just about to run experiments to try pyannote's diarization model and google's uis-rnn to test out how well they work, but it's a tad beyond my ability to evaluate.

I also wonder if Whisper architecture would be good for generating embeddings, but I feel it's focused so much on what is said rather than how it's said that it might not transfer over well to speaker tasks.

phren0logy3y ago

Rev AI will also create a transcription separated by multiple speakers, which it doesn't appear Whisper can do (yet). I expect that Whisper will overtake the alternatives soon, given that it's open source, but today it's not there yet.

blakeburch3y ago

This is awesome to see! Our team at Shipyard [1] has been creating a lot of solution videos on YouTube recently to show teams how they can build A -> B solutions in a few minutes. We've been meaning to provide captions or transcripts for the backlog, but the overhead was either pretty high or too expensive.

Tested this out in the span of a few hours and got a solution up and running to download the video from Youtube, spit out the transcription and upload the resulting transcription file externally. We're still missing a piece to upload directly to YouTube, but it's a start!

As a part of this experiment, we built out some templates that will allow anyone to play around with Whisper in our platform. If you're interested in seeing it, we built a video for doing the process with our templates [2], or directly with Python [3].

Hope someone finds this useful!

[1] https://www.shipyardapp.com [2] https://www.youtube.com/watch?v=XGr4v3aY1e8 [3] https://www.youtube.com/watch?v=xfJpGgyUkvM

liminalsunset3y ago

I really wish I had this about half a year ago when I was building a tool to automatically turn online school lectures into searchable, clickable transcripts (kind of like YouTube or EdX transcripts).

I was originally using Adobe Premiere Pro's speech to text to do it, and wrote Python to convert its output to the Hyperaudio format on GitHub. With this, I can totally skip all of that step and this is fully open source, too.

App idea:

Build an app that takes a video and uses Hyperaudio or a similar project to add a clickable and searchable transcript (clicking in transcript seeks video)

resoluteteeth3y ago

You could already do the speech recognition in a fully open source way with vosk easily, although Whisper may be more accurate

james-revisoai3y ago

You still interested in this? I'd be keen to chat to you, worked on a searchable transcript provider for educational youtube videos (likewise, unfortunately pre-whisper, so I did a lot of work with sentence completion perplexity and rpunct to try and improve transcript quality from youtube automatic transcriptions). Can be contacted at revision.ai and demo what we were able to do till now, would be great to hear your thoughts.

Bayko3y ago

So I guess we can easily use this to generate subtitles?? Which would be nice! Cause ummm some of the movies that I download from the internet arrrrrr! don't have subtitles available

RockRobotRock3y ago

Dude, this is insane. This is so much better than other speech to text libraries I've tried.

jjwiseman3y ago

I'm seeing some weird bugs. For example, in one 30 minute mp3, about 6 minutes in it decided that someone said "2200." And then exactly 5.000 seconds later, "2200". And every 5.000 seconds after that, for the next 24 minutes. (No one actually repeated "2200" for 24 minutes.)

A second run gave better results, but in most runs I do see instances where phrases repeat from 2-20 times.

noreally_3y ago

A notebook is available to try with your microphone on Colab here: https://colab.research.google.com/drive/1nBZ-pDIaIi3N1DIIXvJ...

I'm surprised by the quality on non-English languages, given that 80+% of the training data is English, and the rest is split between tens of languages.

bambax3y ago

Thanks! I played with this in French and posted the results as replies to this comment: https://news.ycombinator.com/item?id=32928643

It's sometimes close to perfect, and sometimes goes off the rail; I think that maybe the model tries to establish some sort of consistency for each sentence; if starts wrong for the first few words of a sentence, it can't build the rest properly.

But it's super fun.

berberous3y ago

How do you get this to translate instead of just transcribe?

tekacs3y ago

To be more specific than the above:

1. Make sure you're using a model that isn't suffixed with `.en` (`base`, not `base.en). 2. Use `model.transcribe(your_input_audio, language='Japanese', task='translate')` ... with the appropriate input language.

paraschopra3y ago

Just specify language and record an audio in another language.

>result = model.transcribe("audio.wav", language="english")

1 more reply

lazylion23y ago

I ran it on this clip

https://clips.twitch.tv/ReliablePopularWerewolfOSkomodo-pcuw...

because... hard accent.

first run whisper thought its welsh so I had to run with --language en , and it did pretty well

https://i.imgur.com/TQiYU9X.png

took 36 seconds in Google colab

tgtweak3y ago

Good to see them releasing model weights - hopefully now that Stable Diffusion is out they will release Dall-E 2 source and weights as well.

Tistron3y ago

It understands my Swedish attempts at English really well with the medium.en model. (Although, it gives me a funny warning: `UserWarning: medium.en is an English-only model but receipted 'English'; using English instead.`. I guess it doesn't want to be told to use English when that's all it can do.)

However, it runs very slowly. It uses the CPU on my macbook, presumably because it hasn't got a NVidia card.

Googling about that I found [plaidML](https://github.com/plaidml/plaidml) which is a project promising to run ML on many different gpu architectures. Does anyone know whether it is possible to plug them together somehow? I am not an ML researcher, and don't quite understand anything about the technical details of the domain, but I can understand and write python code in domains that I do understand, so I could do some glue work if required.

LanternLight833y ago

Hoping to see this out to use in open source voice assistants, eg. mycroft

abidlabs3y ago

Here [1] is a video tutorial on building a web UI that accepts microphone input and runs it through Whisper for speech transcription

[1] https://www.youtube.com/watch?v=ywIyc8l1K1Q&ab_channel=1litt...

amrrs3y ago

Thank you for sharing!

sjsdaiuasgdia3y ago

I was comparing a batch of transcriptions between these models and vosk, and noticed that the medium.en model produces some weird results compared to the others. I've seen a number of loops with one word or a small sequence of words repeating several times. It seems more prone to output that reads like nonsense than the others.

More troubling is a short audio clip that got a few full sentences back, several times the text length that comes back from the other models or vosk. The content of the sentences is extremely far from the audio content. The best alignment I can find is the first word of medium.en's interpretation is somewhat phonetically similar to the audio.

The small.en model doesn't show these behaviors, at least in this data set.

nshm3y ago

The whole value of this model is in 680 000 hours of training data and to reuse this value you need large model, not smaller ones. Smaller versions just don't have enough capacity to represent training data properly.

1 more reply

rexreed3y ago

I'd love to find a way to test this with longer audio but I don't have GPU resources and not exactly sure how to load that into the Colab. Is anyone planning on hosting or sharing a model that can be used by others to test longer form audio (for podcast transcription)?

cercatrova3y ago

Their Scottish accent example is pretty good, I'd like to see it work on some very strong English accents like this one: https://www.youtube.com/watch?v=nJ7QB3om-QY

mod3y ago

Those are Irish.

angrais3y ago

Are you sure? I just ran some of Kimmy's sketches through it and ... The results are garbage.

homarp3y ago

Detected language: english

[00:00.000 --> 00:05.400] Gordy and County Kerry are investigating the theft of up to 60 sheep on Mount Brandon.

[00:05.400 --> 00:10.400] One of the farmers is offering a reward for information leading to the return of the use,

[00:10.400 --> 00:12.200] which are worth thousands of euro.

[00:12.200 --> 00:14.200] Well, I'm fine with that.

[00:14.200 --> 00:15.200] That's right.

[00:15.200 --> 00:16.200] Do you own them?

[00:16.200 --> 00:17.200] Anyone can say it.

[00:17.200 --> 00:18.200] Fine with that.

[00:18.200 --> 00:22.720] Last Saturday, Mikey Joe O'Shea brought his flock of Scotch sheep down from the mountain

[00:22.720 --> 00:25.320] commonage ahead of lambing.

[00:25.320 --> 00:29.840] He discovered over 50 were missing, allowing for a number of deaths and

[00:29.840 --> 00:30.840] strays.

[00:30.840 --> 00:34.600] Mikey is convinced over 45 sheep have been stolen.

[00:34.600 --> 00:35.600] It was a good night.

[00:35.600 --> 00:36.600] It would be a full moon there.

[00:36.600 --> 00:37.600] It would be a good night.

[00:37.600 --> 00:38.600] It would be bright out.

[00:38.600 --> 00:40.600] There could be anyone going up in the mountains.

[00:40.600 --> 00:41.600] It would be a good night.

[00:41.600 --> 00:43.600] Well, that was 45 sheep missing.

[00:43.600 --> 00:49.600] Mikey and the lambs and everything in the sheep, they counted out a nice bit of money.

[00:49.600 --> 00:52.200] They've been doing the boat in Nassan.

[00:52.200 --> 00:53.200] It's a big one. [00:53.200 --> 00:54.200] It's a big one. [00:54.200 --> 00:55.200] It's a big one.

[00:55.200 --> 00:59.000] Mikey's next door neighbor says some of his sheep have also been stolen.

[00:59.000 --> 01:00.000] Come back. [01:00.000 --> 01:01.000] Come back. [01:01.000 --> 01:02.000] Come back.

[01:02.000 --> 01:03.000] I've been missing about 10 years.

[01:03.000 --> 01:04.000] It's not all that difficult.

[01:04.000 --> 01:06.320] All they've got to do is have a good dog.

[01:06.320 --> 01:10.560] Have a good dog and go at night, some moonshine night.

[01:10.560 --> 01:11.560] Just put the dog around him.

[01:11.560 --> 01:14.120] Put him on a trailer and walk him.

[01:14.120 --> 01:18.360] And then probably somebody else to pick him up.

[01:18.360 --> 01:29.960] Everybody's doing it north, but he's doing it.

cercatrova3y ago

Wow that is incredibly impressive. At 0:53 is it translating as well? Didn't sound like English to me.

hegemon83y ago

Wow!

code513y ago

First off, it seems that the model can easily run on M1/M2 with minor modification. However `aten::_index_put_impl_` operator is current not supported and fallback always slows things down quite a lot.

Second, is there a bug with how the script processes incoming audio segments? For a short 4 second clip, what I got was:

> [00:00.000 --> 00:03.760] Okay, Eunice, travel plans. I need to be in New York on Monday, L.A. on Tuesday, New York on Wednesday, L.A. on Thursday. You're knocking Friday. Got it?

> [00:03.760 --> 00:28.760] Got it.

However the final segment should have been shy of 1 second. It mistakenly thinks the last segment was 25 seconds long and makes you wait for processing.

Vecr3y ago

The system only works on 30 second chunks, the system needs padding (and the CLI does the padding for you).

samstave3y ago

AI speech recognition FN scares the heck out of me...

for so many reasons.

But one that really pisses me off is not being able to turn it off on the iphone, and the fact that aside from "hidden cameras in my airBnB" -- soon we will have to worry about secret listening machines EVERYWHERE

wongarsu3y ago

"Secret listening machines everywhere" was a pretty big thing in East Germany. It's also the central theme of the movie The Lives of Others.

Of course, the ability to scale this more cheaply (throwing more compute at it, instead of more people) is somewhat scary, but it's not really introducing a new capability. Especially since you still have to do something with the transcript. An AirBnB landlord who reads the transcript of what you said could as well have listened to the recording.

ALittleLight3y ago

I think it's a new capability to add good speech to text, search, and models that can understand and process text. You have microphones recording speech everywhere, models turning that speech into easily searchable text, and something like GPT-3 reading all the speech and raising red flags for any transgressive idea you please.

1 more reply

jffry3y ago

I'd argue that cheap, pervasive, always-on surveillance with a backlog of searchable transcriptions is a qualitatively different capability.

2 more replies

jfoster3y ago

Also, based on their demo, this model seems like it might have comprehension well above the level of a typical human.

Anyway, it's out there now. No way to turn back.

ma2rten3y ago

We will see an explosion of AI capabilities in the next couple of years. This will have a huge impact on our lives, much of it good but some of it also bad.

samstave3y ago

“Good” for ensuring you’re a compliant consumer - bad if you’re an individual person

bredren3y ago

This is dropping right in the middle of Interspeech 2022.

I don’t believe OpenAI has anyone presenting at the conference, so presumably this was timed to coincide with that and get buzz at the conference.

Curious how this model compares with foss STT from the startup Coqui.

londons_explore3y ago

@dang Can we change the link to the github here[1]?

It seems to describe the project better for a technical audience.

[1]: https://github.com/openai/whisper

londons_explore3y ago

I wonder how much the 30 second window is impacting performance?

Anecdotally, I feel like there are plenty of times that I need context from more than 30 seconds ago to understand some technical jargon that's being discussed.

O__________O3y ago

Anyone know if it is possible to output IPA using this?

International Phonetic Alphabet (IPA)

- https://wikipedia.org/wiki/International_Phonetic_Alphabet

_________

EDIT: Based on list of languages in the tokenizer code here, doesn’t appear IPA is supported:

https://github.com/openai/whisper/blob/5f8d4bcc254d4f3e833d3...

txtai3y ago

Check out this notebook for an example on how to run Whisper as a txtai pipeline in Python or as an API service: https://colab.research.google.com/github/neuml/txtai/blob/ma...

mewse-hn3y ago

I know this isn't a tech support forum but maybe someone here knows. I'm attempting the sample python code from the github and almost get a transcription running on my work laptop without a GPU, but I run into this error message:

>>> result = whisper.decode(model, mel, options)

Traceback (most recent call last):

[snip]

RuntimeError: "slow_conv2d_cpu" not implemented for 'Half'

It looks like a Torch error, is there some twiddling with "options" I can do to get it to run?

mewse-hn3y ago

I seem to have worked around it by tweaking the "options" line from the sample code to this:

>>> options = whisper.DecodingOptions(fp16=False)

ignite3y ago

I am running on work laptop not using GPU. (I'm running in docker). I just get

    warnings.warn("FP16 is not supported on CPU; using FP32 instead")

And it works.

1 more reply

howon923y ago

I just tried it in a few Korean YouTube videos and it's surprisingly accurate, to an extent where I would've thought it was done by a human.

harry83y ago

Can you plug this into a computer on your premises to get speech recognition without amazon, apple or google's cloud (or any other cloud) involvement?

Right now I decline all speed recognition because I don't want orwellian listening devices in my house or pocket and haven't seen an answer. (Also haven't been too bothered about speech command interfaces to bother with a load of research - lazy me).

samat3y ago

Btw, Apple's speech recognition can work completely offline, on-device. Not sure about Google or Microsoft, though.

fragmede3y ago

Yes, after the download of the model weights (from https://openaipublic.azureedge.net/) it's an entirely offline process.

NaturalPhallacy3y ago

This is pretty incredible! https://i.imgur.com/03UFGc8.gif

darkpicnic3y ago

I just wrote a script with Hazel to automatically transcribe my voice notes to txt. It handles punctuation extremely well. What a wonderful contribution!

pbassham3y ago

Exactly what I was planning to do! Want to share yours with me?

emcq3y ago

Be wary of using this model - the licensing of this model seems sketchy. Several of the datasets used for training like WSJ and TED-LIUM have clear non-commercial clauses. I'm not a lawyer but releasing a model as "MIT" seems dubious, and hopefully OpenAI has paid for the appropriate licenses during training as they are no longer a research-only non profit.

jefftk3y ago

This is a big dispute right now: OpenAI and other AI companies generally take the position that models learning from data does not make the output of the models a derivative work of that data. For example, GitHub Co-pilot uses all publicly available GitHub code regardless of license, and DALLE-2/StableDiffusion/etc use lots of non-free images. I don't think this has been challenged in court yet, and I'm very curious to see what happens when it is.

petercooper3y ago

I think it might be even less problematic with something like Whisper than with DALLE/SD? Merely consuming data to train a system or create an index is not usually contrary to the law (otherwise Google wouldn't exist) – it's the publication of copyright content that's thorny (and is something you can begin to achieve with results from visual models that include Getty Photos logo, etc.)

I think it'd be a lot harder to make a case for an accurate audio to text transcription being seen to violate the copyright of any of the training material in the way a visual could.

1 more reply

bscphil3y ago

> models learning from data does not make the output of the models a derivative work of that data

Most of the debate seems to be happening on the question of whether everything produced by models trained on copyrighted work represents a derivative work. I argue that at the very least some of it does; so the claim said to be made by the AI companies (see quote above) is clearly a false one.

We're in a weird place now where AI is able to generate "near verbatim" work in a lot of cases, but I don't see an obvious case for treating this any differently than a human reproducing IP with slight modifications. (I am not a lawyer.)

For example, copyright law currently prevents you from selling a T-shirt with the character Spider-Man on it. But plenty of AI models can give you excellent depictions of Spider-Man that you could put on a T-shirt and try to sell. It's quite silly to think that any judge is going to take you seriously when you argue that your model, which was trained on a dataset that included pictures of Spider-Man, and was then asked to output images using "Spider-Man" as a search term, has magically circumvented copyright law.

(I think there's a valid question about whether models represent "derivative work" in the GPL sense specifically, but I'm using the idea more generally here.)

1 more reply

emcq3y ago

This is even slightly more direct: access to WSJ data requires paying LDC for the download, and the pricing varies depending on what institution / license you're from. The cost may be a drop in the bucket compared to compute, but I don't know that these licenses are transferable to the end product. We might be a couple court cases away from finding out but I wouldn't want to be inviting one of those cases :)

nshm3y ago

I think they didn't use WSJ for training, only for evaluation. Paper includes WSJ under "Evaluation datasets"

pabs33y ago

Are there any AI/ML models that don't use sketchy licensed datasets? Everything seems to be "downloaded from the internet, no license" or more explicitly proprietary. The only exception I can think of would be coqui/DeepSpeech?

Kirkman143y ago

I've been trying Whisper on my old setup (Mac Pro 2012 running Mojave, with Radeon RX 580), and it's a pretty amazing tool.

Unfortunately my system is not ideal for today's AI tools. Whisper runs only on the CPU, and it's slow.

I know PyTorch recently added Metal support, but only for M-based Macs. Has anyone found a way to make it work with Intel Macs?

wodenokoto3y ago

Is it also a translation model? All the example transcripts are in English, regardless of the language of the purportedly transcribed audio.

The description makes it sound like it is a model for transcribing English audio.

> We’ve trained and are open-sourcing a neural net called Whisper that approaches human level robustness and accuracy on English speech recognition.

localy3y ago

Are there any published benchmarks available outlining how this compares to other open source ASR software, such as Coqui.ai?

garfieldnate3y ago

Maybe with this we'll finally get simple bilingual NLU so I can walk around Obi talking to my phone. "Siri, what's Hochlochziegel in English?"

chrisstanchak3y ago

Hold on to your papers

smusamashah3y ago

How well does it do for technical and domain oriented speech? For example I have audio recordings of a senior explaining some very technical aspects of our software. Will it understand the technical terms in that speech?

I guess I will need to download and run on it to see how correct it is.

zeagle3y ago

It would be exceptional to get a healthy competitor to microsoft/nuance's dragon monopoly on voice recognition in healthcare. At a couple thousand bucks a license and the more recent SaaS subscription trend there is a lot of money to be made in that space.

blueberrychpstx3y ago

This is absolute garbage python as I am neither a python developer, nor a good developer. I was trying to play around with real time transcriptions. However, it does work!

> * recording * done recording Recording saved to file.wav Press enter to transcribe

/Users/laptop/Development/Personal/Public/pythonProject1/venv/lib/python3.9/site-packages/whisper/transcribe.py:70: UserWarning: FP16 is not supported on CPU; using FP32 instead warnings.warn("FP16 is not supported on CPU; using FP32 instead") Detected language: english Goodbye, I need to go pick up my wife. Press enter to start recording

Any improvements welcome here.

``` # This is a sample Python script.

# Press ⌃R to execute it or replace it with your code. # Press Double ⇧ to search everywhere for classes, files, tool windows, actions, and settings.

def print_hi(name): # Use a breakpoint in the code line below to debug your script. print(f'Hi, {name}') # Press ⌘F8 to toggle the breakpoint.

def record_microphone(seconds): import pyaudio import wave

    CHUNK = 1024
    FORMAT = pyaudio.paInt16
    CHANNELS = 1
    RATE = 44100
    RECORD_SECONDS = seconds
    WAVE_OUTPUT_FILENAME = "file.wav"

    p = pyaudio.PyAudio()

    stream = p.open(format=FORMAT,
                    channels=CHANNELS,
                    rate=RATE,
                    input=True,
                    frames_per_buffer=CHUNK)

    print("* recording")

    frames = []

    for i in range(0, int(RATE / CHUNK * RECORD_SECONDS)):
        data = stream.read(CHUNK)
        frames.append(data)

    print("* done recording")

    stream.stop_stream()
    stream.close()
    p.terminate()

    wf = wave.open(WAVE_OUTPUT_FILENAME, 'wb')
    wf.setnchannels(CHANNELS)
    wf.setsampwidth(p.get_sample_size(FORMAT))
    wf.setframerate(RATE)
    wf.writeframes(b''.join(frames))
    wf.close()

    return WAVE_OUTPUT_FILENAME

if __name__ == '__main__': seconds = 5 while True: print("Press enter to start recording") input() filename = record_microphone(seconds) print("Recording saved to " + filename) print("Press enter to transcribe") input() import whisper model = whisper.load_model("base")

        result = model.transcribe(filename)
        print(result["text"])

```

manishsharan3y ago

Oh this is a relief to have something opensource in this field. I had using Mozilla Deepspeech for transcribing my voice notes , often with hilarious to incomprehensible results. DeepSpeech is dead ; so I will be sure to check this out.

pabs33y ago

DeepSpeech got spun out of Mozilla to coqui.ai and they are continuing the open nature of the project.

runlevel13y ago

I ran it on some fire department radio recordings from scanners on Broadcastify. It did remarkably well.

For reference, GCP's Speech-to-Text didn't detect any speech from this clip -- even when using the enhanced phone model.

FloatArtifact3y ago

This would be a cool thing to integrate into Dragonfly https://github.com/dictation-toolbox/dragonfly

synkarius3y ago

It would. I wonder how this compares with Kaldi, one of the two open source speech recognition engines that Dragonfly currently supports.

sn413y ago

Most of the comments here are about law enforcement. I would like to point out that it might be a boon for dictation software. This may make it easier to dictate text/code etc. in any environment.

sva_3y ago

It seems like Stable AIs release has led to some real disruption in the ML field regarding open source, and this doesn't seem to be limited to image generation. Excited to see what comes next.

gareth_untether3y ago

I'm thinking of releasing a plugin in for Unity to that can be used to match a phrase to an action. Seeing Whisper is making me think I should include a way to use voice and not just text.

archibaldJ3y ago

Is this practical to be used on the "edge" (for voice-control)? Would love to know if anyone has a rough idea roughly how fast/slow this would be on a M1 Mac or V100

Simorgh3y ago

I’ve been experimenting with voice-interfaces where typing is replaced by talking, but I find it hard to transition users to voice - we ‘seem’ to prefer typing to talking.

I wonder if this will change.

ironlake3y ago

Personally, I would rather type than talk when interacting with a computer. The only time I use voice interfaces are when the physical interface is so poor it's just easier to use voice. Apple TV devices are an example of this.

alexb_3y ago

Combine the translation + transcription with voice synthesis, and once compute power allows for this to be miniaturized we will be able to have babel-fish technology in real life.

simmanian3y ago

Could someone tell me whether it's possible to somehow feed data into this project to improve its translation and transcription capabilities on our own?

spywaregorilla3y ago

Hmm are there any noteworthy open sourced speech to speech models? Like transform a spoken line to another voice, copying both the words spoken and the inflections?

powera3y ago

My first take: it is slow.

The "base" model (supposedly 16x faster than the large one) takes more than the audiofile playback time on my machine to do transcriptions.

fitznd3y ago

I'm seeing even worse. On my M1 Max 2021 macbook pro, I tried transcribing a 30 minute video file and left it on overnight and it was only half way through. I feel like something could be wrong with my setup but I'm only using the defaults.

z3t43y ago

Why not make a demo that you can try out via navigator.mediaDevices.getUserMedia . Of course you will get good results if you demo using the training set.

yawnxyz3y ago

Oh man I remember LOVING Micro Machines as a kid.

But also, this tool seems much better than Otter.ai, which gets every third word wrong when transcribing microbiology recordings

anigbrowl3y ago

Oh nice - I have an immediate use case for this. This looks accessible enough that the sci-fi dream of instantaneous audio translation is suddenly within reach.

revskill3y ago

It's actually better than Google Meet subtitle system.

dubeye3y ago

I know a manual transcription company, which is still seeing modest growth from existing clients who also use ASR, so it's not quite there yet

rlt3y ago

As a casual observer I get the sense that OpenAI and others are very rapidly creating building blocks of something much bigger…

Gazoche3y ago

Pretty cool, and it seems to work on AMD GPUs as well. I've just tried it on my RX6800 with the ROCm build of PyTorch.

michelb3y ago

Quite a high error rate on a very clean-spoken Dutch audio, but way better than anything else I have tried.

howon923y ago

I just tested it on a few of my YouTube videos in Korean and it's surprisingly good at transcription.

jerpint3y ago

I recorded myself speaking French and was able to translate decently well on my laptop. Very impressive!

hijp3y ago

Anyone get it running on m1 mac?

I keep getting `ModuleNotFoundError: No module named 'setuptools.command.build'`

Smaug1233y ago

I'm still not successfully using the GPU, but it's working decently quickly (with the base model - it's incredibly slow to use the Large model) using just the CPU. I'm going to have to check what magic stable-diffusion is doing to enable the GPU :(

dceddia3y ago

There's a --device flag you can pass. I've been trying to get `--device cuda` to work on my Windows machine and it's saying that torch wasn't compiled with CUDA. Trying to figure out what's going on there.

And on the M1, supposedly PyTorch has support for hardware acceleration using MPS (Metal Performance Shaders, announced here https://pytorch.org/blog/introducing-accelerated-pytorch-tra...) but when I tried `--device mps` it blew up with an error "input types 'tensor<1x1280x3000xf16>' and 'tensor<1xf32>' are not broadcast compatible".

2 more replies

simmanian3y ago

I got it working inside a docker container on my M1 MBP. FWIW, I'm having my $180 tinyminimicro PC run a translation task while my M1 MBP runs a transcription task with the same audio input. So far, the PC is actually outputting results a lot faster than the MBP. Interesting results.

kif3y ago

I got requirements installed, but then when running the Python example, I get:

RuntimeError: "slow_conv2d_cpu" not implemented for 'Half'

kif3y ago

Probably need to pass some kind of options when initializing. The command itself works fine, just shows a warning: warnings.warn("FP16 is not supported on CPU; using FP32 instead")

1 more reply

dceddia3y ago

Yep, I had this too. `pip3 install -U pip setuptools` took care of it. (If you get an error about pip3, try `pip` instead)

hijp3y ago

I'm really new to pip, but does this look ok?

(after running the command for setuptools) Defaulting to user installation because normal site-packages is not writeable Requirement already satisfied: pip in /Users/xxx/Library/Python/3.9/lib/python/site-packages (22.2.2) Requirement already satisfied: setuptools in /Users/xxx/Library/Python/3.9/lib/python/site-packages (65.3.0)

---- after trying whisper installation: × Getting requirements to build wheel did not run successfully. │ exit code: 1 ╰─> [20 lines of output] Traceback (most recent call last): File "/Users/xxx/Library/Python/3.9/lib/python/site-packages/pip/_vendor/pep517/in_process/_in_process.py", line 363, in <module> main() File "/Users/xxx/Library/Python/3.9/lib/python/site-packages/pip/_vendor/pep517/in_process/_in_process.py", line 345, in main json_out['return_val'] = hook(*hook_input['kwargs']) File "/Users/xxx/Library/Python/3.9/lib/python/site-packages/pip/_vendor/pep517/in_process/_in_process.py", line 130, in get_requires_for_build_wheel return hook(config_settings) File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/site-packages/setuptools/build_meta.py", line 154, in get_requires_for_build_wheel return self._get_build_requires( File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/site-packages/setuptools/build_meta.py", line 135, in _get_build_requires self.run_setup() File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/site-packages/setuptools/build_meta.py", line 150, in run_setup exec(compile(code, __file__, 'exec'), locals()) File "setup.py", line 2, in <module> from setuptools_rust import Binding, RustExtension File "/private/var/folders/lj/7x6d3dxd3cbdtt484k6xsmyh0000gn/T/pip-build-env-ieaydl8r/overlay/lib/python3.9/site-packages/setuptools_rust/__init__.py", line 1, in <module> from .build import build_rust File "/private/var/folders/lj/7x6d3dxd3cbdtt484k6xsmyh0000gn/T/pip-build-env-ieaydl8r/overlay/lib/python3.9/site-packages/setuptools_rust/build.py", line 23, in <module> from setuptools.command.build import build as CommandBuild # type: ignore[import] ModuleNotFoundError: No module named 'setuptools.command.build' [end of output]

  note: This error originates from a subprocess, and is likely not a problem with pip.

error: subprocess-exited-with-error

2 more replies

knaik943y ago

I got a super weird results with the 'medium' and language Japanese (with a --task translate). The song is False Sympathy by Mondo Grosso.

"[01:17.000 --> 01:32.000] Translated by Releska" when using the translate to english. That entire part of the song is instrumental. This line does not appear at all in the original transcribe only in the opus format rip.

It shows up in the yt rip in format 251 (opus), but not in format 140 (aac from youtube), nor the flac rip. All three are giving different results.

The translation quality is tied to bitrate. Same song converted to different words, the only difference being bitrates and formats. Converting my own rip with the same parameters as yt (opus @140 and then @130) didn't allow me to reproduce this error.

The model hung for a solid extra minute at the end when translating to english, the last 90ish seconds of the song took real time 60 seconds, while the entire rest took about 90. The same behavior was not observed with the transcribe.

Some of the english words are incorrect but that was expected. The first Japanese "mistake" I found was "全ては二人の" instead of "すべてはふたりの". With the left being what whisper wrote. A single random word "hey" was transcribed/translated to english even though it's the singer elongating the 園 while singing the 楽園. "落ちてゆく二人で繋がれた二人のラグ HEY" instead of "落ちていく鎖でつながれた二人の楽園" .

I am using the official subtitles released on the youtube video.

It's a complex Japanese song with both japanese and english, and the original transcribe took about 20 real time seconds to start with the first line, 130 seconds for the whole song. It seems to be showing results in 20 second window increments, but this seems to depend on what it considers audio and what it is throwing away.

On my computer I wasn't able to use the large model because I ran out of VRAM, I have 8gb, not sure how much more it'd require. So I ran it with medium.

The song is False Sympathy by Mondo Grosso. The mv is suggestive, in case that matters. I grabbed a fresh audio rip from Youtube because I didn't want to take it out of my cd case.

https://www.youtube.com/watch?v=B6Y-WsgpzlQ

It is translating this version differently from the director's cut version. I ripped both as opus.

There is something weird about how it is handling the opus encoded version, as I find the same "Translated by Releska" in a wav version transcoded from the opus.

adeptima3y ago

Japanese output will produce lot of tiny mistakes. However the whole output is still good enough. Like 95% plus good enough.

Found lot mistakes in 3-4 characters kanji ... and I guess most native Japanese will do mistakes time to time too, and this is why they pop up lot of buzzwords on screen with all kind of highlighting to avoid double guessing.

ksdk3y ago

Where do you think this place services like Otter.ai, Descript, etc.?

funhighway3y ago

Would be nice to give more details about the provenance and construction of the training data.

bickett3y ago

Hard to keep up with all the great things. The AI community is really moving quick right now.

braindead_in3y ago

Why build a separate model when you can integrate it right into GPT?

throwamon3y ago

Is it feasible to use this for Talon-like voice-driven computer usage?

lunixbochs3y ago

If the Whisper models provide any benefits over the existing Talon models, and if it's possible to achieve any kind of reasonable interactive performance, I will likely integrate Whisper models into Talon.

Talon's speech engine backend is modular, with Dragon, Vosk, the WebSpeech API, and Talon's own engine all used in different ways by users.

FloatArtifact3y ago

Maybe, a number of speech recognition engines have been integrated into https://github.com/dictation-toolbox/dragonfly

pmarreck3y ago

So it's 100% better than Siri's speech dictation, I see

eugenhotaj3y ago

Now someone just needs to pipe the output into stable diffusion.

jdmoreira3y ago

Looking forward to see if this works well with foreign accents

mminer2373y ago

They have an example in the post with a very thick Scottish accent. You should listen to it. It's pretty impressive.

LoveMortuus3y ago

This could be used to make some really cool RPG games!

dot1x3y ago

That's all good and great, now please do OCR...

Havoc3y ago

This could be really cool for mycraft/rasphy etc

nothrowaways3y ago

Great project, not so great package name.

tullie3y ago

Great to see OpenAI finally being open :)

synergy203y ago

is there a high quality text to speech equivalent project like this?

bergenty3y ago

Seriously, when I first landed on the page without reading anything else I thought it was text to speech with the “micro machine” example and I was floored. The speech to text is obviously mind blowing too.

gck13y ago

Got my hopes high that there's finally an open source solution that can deal with Georgian language, only to get my hopes brutally destroyed. It successfully detects a language and then produces garbage. Passing language manually produced similar results.

Result of my own recording:

  Detected language: georgian
   ᔨᴉᴉ�ちゃんᓁᔇ � remnants ᡔ� founding ហ�ockey� slee សᕁ �eling ភᕩ�icularly អᕖᕤ�APPLAUSEPS ថ�Dav頻道 ប�DING� Możai បፘ្ទក ុក ឵� orchestral ុក ឵� arter ូ� Brettំ � 
  hilarious ល ឬ ᔼ� vårក បក ្៙ � Poll statements ឭ᪨្pson. ჩჩრუესი჏მეისლემვეერრშუეაირელმირისასასსსესსერერსივეესრრილმეხრე რეიმიმეფემსესე�

Results of clear Georgian audio [1].

On tiny model:

  Detected language: georgian
  [00:00.000 --> 00:21.560]  én
  [00:21.560 --> 00:23.240] 我伦伦…
  [00:23.280 --> 00:43.720] 我伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦因为b forestry

On medium model:

  Detected language: georgian
   სრჱირესრრრრრრრრრრრრრნსსსრრრრრეე რრირრრრრრრრრე რსრნგნრრრრსრრრრრრრორრრრრრრრრრრ� ḵḸḇḤḾḤḾḤḾḤḾḤḾḤḾḤḾḤḾḾḤḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾ� ḥḾḼḥḾ 
  ḥḾḾ ḥḾḾ ḤḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾ� ḲḵḽḻḽḾ Ḫḵḽḻḽ so� ḻḽḽ ḻḽḻḻḽ ḱᴇ᷻ᵒ ḳᶟᄤḱ ḯᵁ Ḳᴄᴍᴆ Ḧᴍ� Ḧᵒ ḳᴍᴇ ḽᴄᴍᴛᴄ Ḧᴇᴆ ḳᵗᴇ ḽḮᴆ Ḫᴇᴾ ḿᴏᴇᴄᴄᴏ 
  ច�izar� wait �ห� examined ᑇទមះៈេំ supervision ង� იეეეეეეეეეეეეეეეეე მაეე ეაეეეეეეეეეეეეეეეეეეეე დაეეეეეეეეეეეეე უეეეეეეეეეეეეე ეა� ჆ მიი სმეიი მმიეი Ⴢქ სიიეი 
  სავიე სიიითთიიმემი, რაეე სიიმე სიიი ღიიიიწეირი საეიეიი სიიეი სი� ვეეფვეიიიე ქლეეშეეროეეეეეეეეეეეეე. ეგეზ ეყაკშეიეეეეეეეეეეეეეეეეეეეეეეეეეეეეეა, ნრროპიროო მმუმინ 
  სეეკნფეე სეეჍიგოშ სჟებიმელელეეკირპიე სემეიმე სეეიმმმ სეენემეეი სე� ᑦ� Famose m인데요 hqe bywall jaini threshold ji jani den poder vlogging bywall Take the text Ba 
  tou yodamj je te shake ba te shake baou contour but whatever Baou cube baou cup Baou rope Baou people Qeful Qeful იმიიიმიბთმითიიითიიიიიიიი 
  რაოეოოოენპეეეიეიიიიიიიიიომიიიიიიიიი რიიიიიიიიიიიმიი� ნსეეეეეეეეეეეეეეე სარეეეეეეეეეეეეეეეეეეეეეეეეეეეეეეეეეეეეეეეეეეეეეეეეეეეეეე� მጇი჏ვ ეეეიდჼვვ ნაბდადებ 
  ლმირეეეეფედუივევეეეიიეეეეე რარეიეეეევეეეეევეე სარრეეეეეეეეეეეეეეეეეეეეეეეეეეე ხშიიიიიიიიიიიიი ლიიიიიიი ლიიიიიიიიიი ლიიი ლიიიიიიი ლაიიიიი ეიიიიიიიიიიიიიიი იიიი მ�

I've also tested it on few other audio inputs and it failed to produce meaningful results on all of them with all models.

There was one case with another audio [2] and tiny model, where it got at least some words close to their phonetic values, but printed them in cyrillic instead of Georgian and tried to interpret some Georgian words as Russian:

  whisper audio.wav --language Georgian --task transcribe --model tiny
  [00:00.000 --> 00:02.000]  «Зураб Герча Джапарзис Ганц Хатеваром
  [00:02.000 --> 00:04.000]  умерен цупасу Хизгеблоту кащепаста
  [00:04.000 --> 00:06.000]  а опозационермии член шонахлари
  [00:06.000 --> 00:07.000]  с дрородисат Сакартолом
  [00:07.000 --> 00:09.000]  с акутаритеритория бюнда дай бронос
  [00:09.000 --> 00:10.000]  та тасовый торуси сам кадр
  [00:10.000 --> 00:12.000]  Сакартоломший ровно украйенисту
  [00:12.000 --> 00:13.000]  щойго екнебо
  [00:13.000 --> 00:14.000]  амсясахеб кирчи метитаусу
  [00:14.000 --> 00:15.000]  хлебислидерма
  [00:15.000 --> 00:17.000]  уцноктангадацема щейсяа уградунца
  ...

[1] https://www.youtube.com/watch?v=rE_zx_6RhL0 [2] https://www.youtube.com/watch?v=elrXgO8hjtI

faizsn3y ago

Faizan

arpankapoor3y ago

I tried it out on a Hindi speech (https://www.youtube.com/watch?v=4EpfJxKyosE). The transcription starts off decent, but kind of gets stuck repeating the same thing at the 02:40 mark:

    [00:00.000 --> 00:10.000]  पचास ताल में हमने प्रगती किये, इससे को इंटार नहीं कर सकता।
    [00:10.000 --> 00:20.000]  छुनाओ के दौरान वोट मांगते हुए, सरकार की नीतियों पर कठोर से कठोर प्रहार करते हुए,
    [00:20.000 --> 00:28.000]  और पुरानी सरकार की नीतियों नहीं आलोचना करने के लिए लैक बहुत सामग्री थी।
    [00:28.000 --> 00:35.000]  हर जगे मैंने ये कहा कि मैं उन लोगों में से नहीं हूँ, जो पचास वर्च की उपलड्यों पर पानी फिर दे।
    [00:35.000 --> 00:43.000]  ऐसा करना देश के पुर्षार्थ पर पानी फिरना होगा। ऐसा करना देश के किसान के साथ अन्याय करना होगा।
    [00:43.000 --> 01:01.000]  मल्दूर के साथ जात्ती करनी होगा। आम आद्मी के साथ भी वो अच्छा व्योहार नहीं होगा। जो स्वाल आज मन में उच्छा है और उच्छना चाही है। आदावी को पचास साथ होने आये, हम जैनती मनाने जा रहे हैं।
    [01:01.000 --> 01:18.000]  आज देश की स्तिती क्या है। हम पिछर के होगे हैं। प्रगती की दोड़ में, जो देश हमारे साथ आजाद हुए थे, वो हम से आगे बढ़ के। जो देश हमारे बाच जन में थे, वो हमें पीचे छोड़ थे।
    [01:18.000 --> 01:34.000]  दुनिया के गरी तम देशों में हमारी गड़न आये। वीस फीज़ी से जाना लो गरीबी की रेका के नीचे। राक्तपती महुदाय के विभाशन में गाऊं का उल्लेक हैं ना पीरे का पानी नहीं।
    [01:34.000 --> 01:50.000]  हम प्राथमी शिक्षा अनिवारे नहीं कर सकते हैं। लड्कियों की शिक्षा की उपेक्षा हो रही हैं। लड्कि का जन्म लेना तो इस देश में अभी तक एक अभिशाप है।
    [01:50.000 --> 02:07.000]  क्या सरकारी कदम उठाकर समाज में जाग्दृती पैदा करकें। क्या सब लोगों को जुटाकर ये तो ऐसा काम है जिस में कोई दलबंदी के लिए इस्थान नहीं। हम देश का नक्षा नहीं बदल सकते हैं। देश में साधनों की कमी नहीं है।
    [02:07.000 --> 02:07.000]  और साधनों की अगर कमी है तो उसको ठीक दन्त से प्राप्त किया जा सकता है। साधन बड़ाए भी जा सकते है। लेकिन जो साधन हैं उनका ठीक उपयोग नहीं हो रहा। जंता के उपर टेक्स लगाकर जो दन्नि कप्ता किया जाता है। उसका लाग जंता तक नहीं पहु
    [02:37.000 --> 02:37.000]  रख्कम जाती है। विदेशी बैंको में दन जाने का सिल्सिला अभी तक क्यों काएं है। उसको लोकने के लिए क्या कदम उठाएगे। हम विदेशी पूजी के लिए प्रैत्रशील हैं विदेशी पूजी आए और अगर विदेशी पूजी आती है अच्छे दन्त की टेक
    [03:07.000 --> 03:07.000]  अच्छे दन्त की पूजी आती है अच्छे दन्त की पूजी आती है अच्छे दन्त की पूजी आती है अच्छे दन्त की पूजी आती है
    [03:37.000 --> 03:39.000]  अच्छे दन्त की पूजी आती है अच्छे दन्त की पूजी आती है
    [04:07.000 --> 04:09.000]  अच्छे दन्त की पूजी आती है अच्छे दन्त की पूजी आती है
    [04:37.000 --> 04:39.000]  अच्छे दन्त की पूजी आती है अच्छे दन्त की पूजी आती है

The translation does a much better job however:

    [00:00.000 --> 00:10.000]  In the last 50 years, we have made progress, no one can deny this.
    [00:10.000 --> 00:20.000]  During the elections, while asking for votes, while attacking the government's policies harshly,
    [00:20.000 --> 00:28.000]  and to criticize the policies of the old government, a lot of material was needed.
    [00:28.000 --> 00:35.000]  Everywhere, I have said that I am not one of those people who pour water on the fruits of 50 years.
    [00:35.000 --> 00:39.000]  To do this, we will have to pour water on the efforts of the country.
    [00:39.000 --> 00:43.000]  To do this, we will have to do injustice with the farmers of the country.
    [00:43.000 --> 00:45.000]  We will have to do caste with the laborers.
    [00:45.000 --> 00:50.000]  Even with the common man, that will not be a good behavior.
    [00:50.000 --> 00:55.000]  The question that arises in the mind today and should arise,
    [00:55.000 --> 01:01.000]  Freedom has come to be 50 years, we are going to celebrate.
    [01:01.000 --> 01:04.000]  What is the situation of the country today?
    [01:04.000 --> 01:07.000]  Why did we get separated?
    [01:07.000 --> 01:14.000]  In the race of progress, the country that got freedom along with us, they went ahead of us.
    [01:14.000 --> 01:19.000]  The country that was after us, they left us behind.
    [01:19.000 --> 01:25.000]  In the poorest countries of the world, they counted us.
    [01:25.000 --> 01:29.000]  20% of the population is below the poverty line.
    [01:29.000 --> 01:35.000]  In the speech of the President, there is no mention of villages or drinking water.
    [01:35.000 --> 01:39.000]  We cannot enforce primary education.
    [01:39.000 --> 01:43.000]  The education of girls is being neglected.
    [01:43.000 --> 01:50.000]  The birth of a girl is still a curse in this country.
    [01:50.000 --> 01:55.000]  Is it by taking government steps, by creating awareness in the society?
    [01:55.000 --> 02:01.000]  Is it by uniting all the people that there is no place for party?
    [02:01.000 --> 02:05.000]  Can't we change the map of the country?
    [02:05.000 --> 02:08.000]  There is no shortage of resources in the country.
    [02:08.000 --> 02:14.000]  And if there is a shortage of resources, it can be obtained in the right way, resources can be increased.
    [02:14.000 --> 02:21.000]  But the resources that are there, they are not being used properly.
    [02:21.000 --> 02:30.000]  The wealth that is collected by taxing the public, its profit does not reach the public, it does not reach the common man.
    [02:30.000 --> 02:32.000]  Where does it go?
    [02:32.000 --> 02:35.000]  Whose pockets are filled?
    [02:35.000 --> 02:39.000]  Whose treasury does that money go to?
    [02:39.000 --> 02:44.000]  Why is the chain of money going to foreign banks still established?
    [02:44.000 --> 02:47.000]  What steps have been taken to stop it?
    [02:47.000 --> 02:52.000]  We are motivated for foreign worship, foreign worship has come.
    [02:52.000 --> 03:01.000]  And if foreign worship comes for good technology, for infrastructure,
    [03:01.000 --> 03:06.000]  for education, then no one will object.
    [03:06.000 --> 03:11.000]  I believe that our communist friends will not object either.
    [03:11.000 --> 03:19.000]  But is the maximum use of the resources in the country happening?
    [03:19.000 --> 03:26.000]  Is it not true that corruption has become a national disease?
    [03:26.000 --> 03:31.000]  I remember that Swargi Rajiv Gandhi had said in a speech that I send one rupee from Delhi,
    [03:31.000 --> 03:36.000]  but where I send the rupee, as I reach there, 19 paise are left.
    [03:36.000 --> 03:41.000]  I asked him how this miracle happens.
    [03:41.000 --> 03:47.000]  Bhaskar said that when the rupee runs, it shrinks.
    [03:47.000 --> 03:54.000]  The rupee shrinks, it gets into the hand, it goes into the pocket, it becomes small.
    [03:54.000 --> 03:58.000]  It is difficult to recognize the rupee.
    [03:58.000 --> 04:02.000]  The rupee can be hidden.
    [04:02.000 --> 04:06.000]  The situation of the currency of the country is not good.
    [04:06.000 --> 04:10.000]  First, the government expenditure has increased, it is increasing.
    [04:10.000 --> 04:17.000]  It needs common consent to reduce without reducing.
    [04:17.000 --> 04:24.000]  No one can work in the same way.
    [04:24.000 --> 04:27.000]  Yes, our old Prime Minister Narasimha Raoji,
    [04:27.000 --> 04:34.000]  if he would have tried in this direction after stabilizing himself, then he would have succeeded.
    [04:34.000 --> 04:47.000]  But he was stuck in some such things that he could not pay attention to these problems.

biggerChris3y ago

We have reached sentient mode.

j / k navigate · click thread line to collapse

481 comments

pen2l3y ago

Neat, https://github.com/openai/whisper - they have open-sourced it, even the model weights, so they are living up to their name in this instance.

I think it's fair to say that AI-transcription accuracy is now decidedly superior to the average human's, what the implications of this are I'm not sure.

anigbrowl3y ago

thfuran3y ago

>~97% accuracy over hour-long conversations. I'm sure it's been an absolute godsend for law enforcement

7 more replies

adamgordonbell3y ago

I've not found that to be the case.

I'm interested to test out whisper on this one.

https://corecursive.com/063-apple-2001/

deegles3y ago

1 more reply

biomcgary3y ago

Since you work on podcasts, do any open source transcription tools currently identity the speaker in the output? This would be particularly helpful for interviews.

1 more reply

nonoesp3y ago

I'm not sure if you've tried Descript, but their ML-based "Studio Sound" filter makes bad audio sound like it was recorded and edited nicely.

solarmist3y ago

Any recommendations for particular services?

1 more reply

solarmist3y ago

That is an exciting possibility. Being able to fix bad setups and missed takes automagically. It’s always been possible, just expensive and time consuming for moderate improvements.

bambax3y ago

The French version is a little contrived. The speaker is a native speaker, but the text is obviously the result of a translation from English to French, not idiomatic French.

I will try to put the code to the test, see how it goes.

pen2l3y ago

3 more replies

octref3y ago

I'm interested in building something with this to aid my own French learning. Would love to read your findings if you end up posting it somewhere like twitter/blog!

3 more replies

suyash3y ago

More of this is welcome, they should live up their name and original purpose and share other models (code, weights, dataset) in the open source community as well.

Workaccount23y ago

Can't wait to see twelve new $49.99/mo speech parser services pop up in the next few weeks.

quickthrower23y ago

Make hay before Google gives away free hay.

That said there is value in integration of this into other things.

1 more reply

knaik943y ago

quickthrower23y ago

Some music is hard for even people to make out the lyrics to.

darepublic3y ago

> Neat, https://github.com/openai/whisper - they have open-sourced it, even the model weights, so they are living up to their name in this instance.

Perhaps it will encourage people to add voice command to their apps, which can be sent to gpt3

pabs33y ago

Is the training dataset and code open too?

jfoster3y ago

It seems like OpenAI are finally living up to their name for once with this release? Anything I'm missing?

From what I can gather:

1. Includes model weights. I can't find the URL, but they reference them enough and have a CLI tool, so I presume I just haven't found them yet.

2. Includes code: https://github.com/openai/whisper

3. Released under MIT License: https://github.com/openai/whisper/blob/main/LICENSE

thesausageking3y ago

It's one model and in a non-strategic area where there are existing open source projects (Kaldi, DeepSpeech, ...).

For a company that raised $1B, that's not exactly living up to their name and original mission.

blagie3y ago

Yes. The same is true of many products from many companies.

I'd given up on OpenAI being open or ethical, but this is a start. It took them down from "evil super-villain" status to mere villain.

whimsicalism3y ago

> It's one model and in a non-strategic area where there are existing open source projects (Kaldi, DeepSpeech, ...).

I can already tell this is much better than any of the existing open source projects with the exception of the wav2* sequence of projects and potentially nvidia's nemo.

1 more reply

solarmist3y ago

This kind of model is harder to abuse, so I guess it passed their internal checks much more easily.

I can understand not releasing GPT-3, even if I disagree with the decision.

ignoramous3y ago

> This kind of model is harder to abuse, so I guess it passed their internal checks much more easily.

The version I choose to believe: stability.ai ate DALL-E for lunch, and that woke them up.

1 more reply

jfoster3y ago

1 more reply

dwohnitmok3y ago

> I can understand not releasing GPT-3, even if I disagree with the decision.

Why do you disagree?

3 more replies

StevenWaterman3y ago

(Model weights from https://github.com/openai/whisper/blob/main/whisper/__init__... )

"tiny.en": "https://openaipublic.azureedge.net/main/whisper/models/d3dd5..."

"tiny": "https://openaipublic.azureedge.net/main/whisper/models/65147..."

"base.en": "https://openaipublic.azureedge.net/main/whisper/models/25a85..."

"base": "https://openaipublic.azureedge.net/main/whisper/models/ed3a0..."

"small.en": "https://openaipublic.azureedge.net/main/whisper/models/f953a..."

"small": "https://openaipublic.azureedge.net/main/whisper/models/9ecf7..."

"medium.en": "https://openaipublic.azureedge.net/main/whisper/models/d7440..."

"medium": "https://openaipublic.azureedge.net/main/whisper/models/345ae..."

"large": "https://openaipublic.azureedge.net/main/whisper/models/e4b87..."

mmastrac3y ago

Large is 3GB to save everyone a click. Tiny is 72MB.

1 more reply

danso3y ago

[0] https://www.youtube.com/watch?v=DS6pE88Xg3s

[1]

    $ yt-dlp --extract-audio --audio-format mp3 -o wire-fuck.mp3 https://www.youtube.com/watch?v=DS6pE88Xg3s

    $ whisper --language en wire-fuck.mp3
    [00:00.000 --> 00:02.000]  Oh
    [00:13.260 --> 00:15.260]  Fuck
    [00:15.260 --> 00:31.260]  Motherfucker
    [00:50.700 --> 00:52.700]  Fuck
    [00:52.700 --> 00:58.700]  Oh
    [00:58.700 --> 01:10.700]  Fuck
    [01:28.700 --> 01:55.900]  Fuck
    [02:02.340 --> 02:03.700]  Motherfuck.
    [02:10.220 --> 02:11.220]  Oh, fuck.
    [02:11.780 --> 02:12.780]  Oh, fuck.
    [02:25.900 --> 02:27.900]  Fuck, fuck, fuck, fuck, fuck, fuck.
    [02:27.900 --> 02:28.900]  Motherfucker.
    [02:32.900 --> 02:33.900]  Oh, fuck.
    [02:34.900 --> 02:35.900]  Fuck.
    [02:35.900 --> 02:36.900]  Oh, fuck.
    [02:36.900 --> 02:37.900]  Oh, fuck.
    [02:37.900 --> 02:38.900]  Oh, fuck.
    [02:48.900 --> 02:49.900]  Motherfucker.
    [02:53.900 --> 02:54.900]  Fucking A.
    [02:54.900 --> 02:56.900]  Mm hmm.
    [02:56.900 --> 03:12.900]  Fuck.
    [03:26.900 --> 03:28.900]  Motherfucker.
    [03:28.900 --> 03:32.900]  Fuck me.
    [03:58.900 --> 04:01.900]  Oh.
    [04:28.900 --> 04:34.900]  Fuck.

marcelfahle3y ago

As interesting as it is funny. Great benchmark! Here's the rev.ai output for comparison:

  Speaker 0    00:00:12    Oh, fuck motherfucker. Okay. Fuck, fuck, fuck, fuck, fuck, fuck, fuck, fuck. 
 My little fuck.  
  Speaker 1    00:02:10    Oh, fuck. Oh, fuck,  
  Speaker 0    00:02:25    Fuck, fuck, fuck, fuck, fuck, fuck, fuck, fuck my motherfucker.  
  Speaker 1    00:02:53    Fucking a.  
  Speaker 0    00:02:54    Mm-hmm. <affirmative> motherfucker. Fuck me. Um,

AndrewKemendo3y ago

I've been on HN since 2012 and this might be one of the best comments I've ever read

owenpalmer3y ago

nsfw

TaylorAlexander3y ago

Snitch-Thursday3y ago

Google's recorder app for android will let you record audio files and make some transcriptions, right on the device.

capableweb3y ago

1 more reply

Tenoke3y ago

1 more reply

olao993y ago

Google's recorder app is NOT available for most phones. Only Pixels and a couple of other selected handsets

petercooper3y ago

zhynn3y ago

TaylorAlexander3y ago

Anyway I am pretty terrible with email but short exchanges can work for me, or maybe we can connect over signal. Send me a message to my email in my profile and I would be happy to sync up!

tekacs3y ago

I do this too, and I’ve built some software for it just for myself.

I’d love to chat and hear about how you use this! My email is in my profile, or I’m @tekacs on Twitter (and everywhere). :)

blueberrychpstx3y ago

Count me in!! Working on tools actually to turn these transcriptions into something more social

gok3y ago

Comparing this model's word error rates to the state of the art [1] on a few common test sets:

                           Whisper    SoTA
  LibriSpeech test-clean      2.7%     1.8%
  LibriSpeech test-other      5.6%     2.9%
  Switchboard                13.1%     4.9%
  CallHome                   15.8%     9.5%

The authors do explicitly state that they're trying to do a lot of fancy new stuff here, like be multilingual, rather than pursuing just accuracy.

[1] https://github.com/syhw/wer_are_we

lunixbochs3y ago

Comparing the readily available test sets from the paper to some of my personal robust models (for the Talon models, this is greedy decoding, no language model):

                       Talon  Talon  Talon  Whisper  wav2vec 2.0
                       28M    300M   1B     Large    960h
    librispeech clean   3.21   2.52   2.40   2.7      2.7
    librispeech other   8.21   6.56   5.63   5.6      6.2
    common voice       13.88  11.65   8.86   9.5     29.9
    tedlium             7.51   6.55   5.47   4.0     10.5

ma2rten3y ago

I'm looking forward to your comparison. It's really hard to make sense of how good this model actually is without being an expert in the area.

1 more reply

allanrbo3y ago

Talon was the first thing that came to my mind when I saw this news. Would be nice if it could benefit from Whisper. (Big fan of your work on Talon!)

nshm3y ago

It is interesting how they compare with wav2vec2 instead of nemo conformer (which is more accurate) in Table 2.

1 more reply

StevenWaterman3y ago

One of the things they point out is that the SoTA on e.g. LibriSpeech is only good at LibriSpeech, and doesn't generalise as well.

lunixbochs3y ago

It looks like Whisper is probably leaving a lot of accuracy on the table, but initially it does seem to be a lot more robust than general "SOTA" models.

For a quick comparison, Silero's accuracy charts are kind of nice because they post results for a large variety of datasets. Scroll down to the EN V6 xlarge EE model (not the xlarge CE) [1]

[1] https://github.com/snakers4/silero-models/wiki/Quality-Bench...

petercooper3y ago

ma2rten3y ago

petercooper3y ago

No transcripts, no. And recent episodes, within the past couple of weeks, so probably not part of the training either.

WiSaGaN3y ago

True. The test should only be done on the material released after the model.

andy_xor_andrew3y ago

Hold on, it does not only speech recognition, but also language translation, in the same model?

What an interesting approach. What benefits does this have over having two dedicated models, one for speech-to-text, and another for translation?

Does knowledge of speech-to-text carry over into knowledge of translation? Does knowledge of translation carry over into knowledge of speech-to-text? So weird.

TaylorAlexander3y ago

magicalhippo3y ago

Would a multilingual modal perhaps also be better at understanding non-natives speech?

1 more reply

newhaus19943y ago

ByThyGrace3y ago

Now I wonder if it works equally well with Spanish from Spain (and its different regions) and Spanish from the New World (and in its myriads of different flavours).

beanlog3y ago

It sounds useful to me because you can use tone information to help with the translation, which text-to-text translation can't do. But I'm not sure if that's how this model actually works.

thuttinger3y ago

I tried running it in realtime with live audio input (kind of).

If you want to give it a shot, you can find the python script in this repo: https://github.com/tobiashuttinger/openai-whisper-realtime

kkielhofner3y ago

Haven’t tried it yet but love the concept!

Have you thought of using VAD (voice activity detection) for breaks? Back in my day (a long time ago) the webrtc VAD stuff was considered decent:

https://github.com/wiseman/py-webrtcvad

Model isn’t optimized for this use but I like where you’re headed!

thuttinger3y ago

Interesting. I'll take a look at this, thanks!

1 more reply

adeptima3y ago

Japanese results looks pretty impressive!

Took マッコウクジラ14頭が海岸に打ち上げられるオーストラリア(2022年9月21日) https://www.youtube.com/watch?v=bZkNIzeRBk4

Extracted audio with youtube-dl -f bestaudio https://www.youtube.com/watch\?v\=bZkNIzeRBk4

gzer03y ago

Shocked at how good the results are, and how easy of an installation it is.

Here are the exact steps to follow to get it running on Ubuntu 22.04 via WSL and yt-dlp:

  1. pip install git+https://github.com/openai/whisper.git

  2. yt-dlp -f 'ba' -x --audio-format mp3 https://www.youtube.com/watch/?v\=bZkNIzeRBk4

  3. renamed the file to test.mp3

  4. whisper test.mp3 --language Japanese --task translate --model large

Note: the large model will download a ~3Gb file

NaturalPhallacy3y ago

I did something similar (my ytdl is ytdlp too). You don't even have to grab just the audio, it'll take a webm: https://i.imgur.com/03UFGc8.gif

Amazing work.

1 more reply

adeptima3y ago

"--model large" option produces much better results at higher resources consuming costs

knaik943y ago

Did you try translating them to english? I want to see if you get a similar error as me with a random phrase "Translated by Releska" showing up.

lynguist3y ago

1 more reply

dom963y ago

I just wonder what system requirements Whisper has and whether there are open source voice recognition models that are specifically built for embedded devices.

MacsHeadroom3y ago

Edit: According to this comment[0] the base model runs in real time on an M1 CPU. The tiny model apparently decodes an audio file twice as fast. These are promising results.

[0] https://news.ycombinator.com/item?id=32927360#32929739

lunixbochs3y ago

You could use really small chunk sizes and process them in a streaming fashion, but that would impact accuracy, as you're significantly limiting available context.

dom963y ago

I'd be interested to see how well it performs on something like an RPi. M1 is pretty beefy.

olao993y ago

To be more precise the original comment said "M1 Max" which in itself is significantly beefier a bare "M1"

suyash3y ago

This is only one side of the coin, you still need really good models for Speech Synthesis and then be able to have it all working in almost real time, ideally locally on device.

ricopags3y ago

As far as TTS goes, Mycroft.ai[0] has released a decent offline one.

[0]https://mycroft.ai/

2 more replies

solarkraft3y ago

Are you thinking about reimplementing Mycroft?

The Mycroft has done a lot of cool and important work in the field to ship an actual personal assistant product (stuff like wake word detection).

dom963y ago

2 more replies

mwlp3y ago

Super impressive. I tested it on a Japanese streamer whose enunciation isn't exactly perfect and it did a decent job: https://www.youtube.com/watch?v=ROiOU1scaNA

  [00:00.000 --> 00:06.500]  Since the last one started, the number of times I've eaten has decreased.
  [00:06.500 --> 00:11.000]  If I get too carried away with the last one, I'll get hungry and do it.
  [00:11.000 --> 00:14.500]  I don't have time to eat.
  [00:15.500 --> 00:18.000]  I'm going to eat now.
  [00:20.000 --> 00:23.000]  It's going to take about 10 minutes from here.
  [00:23.000 --> 00:31.000]  It's been a while since I've had my last meal.
  [00:31.000 --> 00:36.000]  I feel like I'm losing my女子力.
  [00:36.000 --> 00:39.000]  I have to go back to my original self.
  [00:39.000 --> 00:44.000]  I have to get ready and go to bed.
  [00:44.000 --> 00:46.000]  It's not good.
  [00:46.000 --> 00:51.000]  I've been drinking a lot lately, so I'm going home.
  [00:51.000 --> 00:53.000]  I have to get my nails done this fall.
  [00:53.000 --> 00:54.000]  Halloween nails.
  [00:54.000 --> 00:57.000]  Halloween, Halloween, Halloween.
  [00:57.000 --> 00:59.000]  I'm going to the beauty salon today.
  [00:59.000 --> 01:02.000]  I'm going to get my nails done the day after tomorrow.
  [01:02.000 --> 01:10.000]  I used to look at a lot of clothes, but I stopped looking at them.
  [01:10.000 --> 01:12.000]  I'm going crazy.
  [01:12.000 --> 01:22.000]  My stomach's stopped in the middle of summer.

magicalhippo3y ago

It's struggling with Norwegian. Which I guess isn't shocking. The large model performs a fair bit better than the small, though neither is "good".

Though I assume the amount of Norwegian it has been exposed to is fairly limited, so in that light I'm actually impressed as well.

I tried it on a news segment from the radio[1], this is the large model output:

    [00:14.000 --> 00:17.200]  En skamløs krenking av FN pakten.
    [00:17.200 --> 00:24.000]  USAs president og verdensledere svarer på den russiske presidentens atomtrusler og krigsmobilisering.
    [00:25.500 --> 00:29.400]  Arbeidsklær som er ment til å være til begge kjønn, har det med å være tilpasset.
    [00:29.400 --> 00:33.400]  Men hvordan ville det gått, om det var motsatt?
    [00:34.100 --> 00:38.900]  Dyrevernsorganisasjon vil ha digital merking av regnstyr,
    [00:38.900 --> 00:44.900]  men næringen selv insisterer på den gamle tradisjonsrike måten med rissing av kniv.
    [00:45.600 --> 00:51.400]  Mange strømselskaper er positive til å tilby kundene fastpris på strøm, og det årevis.
    [00:51.400 --> 00:59.900]  Da risikerer de å måtte betale mye i nettopp åretsvis, sier aktører som aldri tilbyr fastpris.
    [00:59.900 --> 01:21.900]  Dette er onsdagens Dagsnytten. Jeg heter Espen Ås.

For reference, here's what he actually said, from the source[1] itself:

    * En skamløs krenking av FN-pakten. USAs president og verdensledere svarer på den russiske presidentens atomtrusler og krigsmobilisering.
    * Arbeidsklær som er ment å være til begge kjønn, er som regel tilpasset ... menn. Hvordan hadde det gått om det var motsatt?
    * Dyrevernsoganisasjon vil ha digital merking av reinsdyr, men næringen selv insisterer på den gamle tradisjonsrike måten med rissing av kniv.
    * Mange strømselskaper er positive til å tilby kundene fastpris på strøm - og det i årevis.
    - Da risikerer de å måtte betale mye i nettopp; årevis, sier aktør som aldri tilbyr fastpris
    Dette er onsdagens Dagsnytt 18 - jeg heter Espen Aas.

The translation didn't fare that well though:

    [00:14.000 --> 00:17.000]  A shameless violation of the UN treaty.
    [00:17.000 --> 00:24.000]  The US president and world leaders respond to the Russian president's nuclear threats and war mobilization.
    [00:24.000 --> 00:33.000]  Work clothes that are meant to be for both genders have to be suitable, but how would it be if it was the other way around?
    [00:34.000 --> 00:44.000]  The animal welfare organization will have a digital marking of reindeer, but the industry itself insists on the old traditional way of tearing a knife.
    [00:45.000 --> 00:51.000]  Many electricity companies are positive in offering customers fixed electricity prices, and that is annual.
    [00:51.000 --> 00:58.000]  Then they risk having to pay a lot in just a year, says an actor who has never offered fixed prices.
    [00:58.000 --> 01:20.000]  This is Wednesday's Dagsnytt 18. My name is Espen Ås.

For reference, here's Google Translate's attempt, which is pretty good:

    * A shameless violation of the UN Charter. The US president and world leaders respond to the Russian president's nuclear threats and war mobilization.
    * Work clothes intended for both sexes are usually adapted to ... men. How would it have gone if it had been the other way around?
    * Animal welfare organizations want digital marking of reindeer, but the industry itself insists on the old, traditional way of marking with a knife.
    * Many electricity companies are positive about offering customers a fixed price for electricity - and for years.
    - Then they risk having to pay a lot in precisely; for years, says a player who never offers a fixed price
    This is Wednesday's Dagsnytt 18 - my name is Espen Aas.

[1]: https://radio.nrk.no/podkast/dagsnytt_atten/l_5ce3e323-97a3-... (not sure if it's available outside of Norway)

magicalhippo3y ago

It also didn't handle the hanging "... menn", instead thinking it was the start of the following sentence. Almost everyone would understand it was the end of the sentence based on the context.

The double-A vs Å is not an issue as it's the same letter, double-A is the older form.

The small model was considerably worse than the large one though.

1 more reply

perlgeek3y ago

Everything (and everyone, including myself :D ) seem to struggle with Norwegian, it seems the corpus size is simply too small. And/or maybe the market.

Deepl didn't do any Norwegian last I looked, even though it does most other Germanic languages (including Danish and Swedish).

Duolingo doesn't have a Norwegian class for Germans either, though they do have one with English as the source language.

olao993y ago

1 more reply

alach113y ago

How long until this gets implemented in Twitch? Real-time subtitles for any stream in the language of your choice?! That would be huge.

adeptima3y ago

translation is not the strongest part. transcription looks very good.

shpx3y ago

We shouldn't call this open source. The model definition + the data is the source code. The model weights are a compilation artifact.

> The source code must be the preferred form in which a programmer would modify the program. [...] Intermediate forms such as the output of a preprocessor or translator are not allowed.

> https://opensource.org/osd

Yes that means that there are almost no open source models and yes it's awesome that they released this and made the weights available. Just don't call it open source.

pabs33y ago

The Debian deep learning team's machine learning policy would call this a "toxic candy" model:

https://salsa.debian.org/deeplearning-team/ml-policy

BTW, wouldn't you take the existing model and do additional Hokkaido Japanese speaker training on top of it, rather than retraining the model from scratch?

rvz3y ago

Yes. It just like calling the release of compiled closed binary blobs as 'open source' even when the source of reproducing the compiled output is unavailable.

> Just don't call it open source.

That is true, it is still closed source and already we are seeing the hype squad already apologising to OpenAI as they 'open sourced' a closed model that you can't modify yourself.

OpenAI is still business as usual and nothing has changed.

MacsHeadroom3y ago

>You will still be contacting OpenAI for support or to add support for another language and they will be the ones able to modify the model.

This isn't quite correct. The model weights are all you need to fine tune the data on your own with your own audio.

Without the original training set this still isn't open source. But you aren't powerless to modify the model without the original training set.

nl3y ago

This isn't really true.

You can do a lot with weights and no training data - for example you can pull the end layer off it and use it as a feature extractor.

> The source code must be the preferred form in which a programmer would modify the program.

As a machine learning programmer I'd much prefer the weights than the raw data. It's no realistic for me to use that training data in any way with any compute I have access to.

toss13y ago

Like every model I've seen there is something like this:

>>A decoder is trained to predict the corresponding text...

Prediction of expected text in the context of the previous text.

While this is valuable in casual transcription, it can be extremely dangerous in serious contexts.

From personal experience, having given a deposition with an "AI" transcription, it will literally reverse the meanings of sentences.

This is because it produces the EXPECTED output in a context, and NOT THE ACTUAL OUTPUT.

Tomis023y ago

toss13y ago

>>Current "AI" algorithms are fundamentally flawed because they rely on a statistical approach.

YES! The old joke about "Artificial Stupidity" is actually more true than anyone realized.

These statistical so-called-AI systems actually work to actively REMOVE or sanitize out any unexpected information, making it all conform with the EXPECTED results from the training set.

I'm not saying that there are no useful things that can be done with this technology — there is a LOT of mundane work out there to be done.

But, we will never get this type of "AI" saying "Huh, that's odd, I wonder why that is?", which is exactly the kind of observation that leads a prepared and fertile mind to great discoveries.

lunixbochs3y ago

Do you have a demo audio clip for this? I'd be interested to see how it looks in practice.

toss13y ago

Sorry, I don't have anything available.

In another instance, I said that "Evidently, you have a reading comprehension problem.". It replaced it with "Evidently, I have a ...", completely reversing the meaning.

eatsyourtacos3y ago

Can this be used as a real-time transcription or is it too slow for that?

Curious what anyone is using these days for a real-time transcription. It doesn't have to be perfect, but just good enough.

My kids watch some youtube vidoes where people will make a mod where it converts them talking to text then look for keywords and spawn a boss in Terraria if you say the wrong keyword etc.

NaturalPhallacy3y ago

I tried it out and it's way too slow on my machine that is no slouch (Ryzen 9 5950/GTX 3080).

It's doing seconds of translation per minute for me at least.

whimsicalism3y ago

It might require too much work for what you are looking for, but the wav2letter library is the best real-time transcription OSS I have found by a considerable margin.

davidzweig3y ago

Out of interest, did you try Nemo? https://github.com/NVIDIA/NeMo

1 more reply

blueberrychpstx3y ago

https://developer.apple.com/documentation/speech/recognizing...

Also, see `requiresOnDeviceRecognition`

1 more reply

TaylorAlexander3y ago

The base model seems to run faster than real time on my machine. The “medium” model is larger and runs more slowly - roughly real time or maybe slightly slower.

nshm3y ago

Try https://github.com/alphacep/vosk-api/blob/master/csharp/demo...

jayavanth3y ago

thuttinger posted in this thread: https://github.com/tobiashuttinger/openai-whisper-realtime

suyash3y ago

Depends if you're trying to run it offline or over the cloud.

StevenWaterman3y ago

That example at the top of the page (speed talking) blew me away. He started talking, I was stunned for a minute, then realised yes, it really was English, and I just burst out laughing.

That's so, so far beyond the previous state-of-the-art, it's absurd.

NaturalPhallacy3y ago

It's a micromachines ad from the '80s. He talked like that in all of them!

As for speed, to a computer we don't talk very fast, not even that guy.

I wonder if it could handle Rap God by Eminem....Let's find out!

dreamer73y ago

Did you find out :D?

3 more replies

TOMDM3y ago

Given how robust it seems to be with fast speech, I wonder if you could save cycles by speeding up the audio before feeding it in.

The5thElephant3y ago

wongarsu3y ago

Also, you are comparing Whisper's highlight reel with everyday performance of other models. Nobody shows their weaknesses in their highlight reel.

coder5433y ago

Someone else in this thread[0] said Whisper was running at 17x real time for them. So, even a weak machine might be able to do an acceptable approximation of real time with Whisper.

I really wish there was an easy demo for Whisper that I could try out.

[0]: https://news.ycombinator.com/item?id=32928207

2 more replies

The5thElephant3y ago

alex_marchant3y ago

Siri until ios 15 was done in the cloud IIRC.

fxtentacle3y ago

This AI has a 30 second delay on the audio processing because it needs to be able to "look into the future" to get these good results. That 30s delay would be unacceptable for Siri/Google/Cortana.

coder5433y ago

Maybe I'm missing something, but I don't see the problem here.

1 more reply

beastman823y ago

In my unmeasured empirical observation Google has amazing speech recognition

jeffbee3y ago

The5thElephant3y ago

I agree they have the best compared to Apple, Amazon, Microsoft. However I don't think it is as good as what is being shown here by OpenAI.

1 more reply

Kuinox3y ago

OpenAI is owned by Microsoft FYI.

neongreen3y ago

Is it? Googling suggests that Microsoft invested in OpenAI but doesn’t actually own it.

1 more reply

mmh00003y ago

Okay this is super impressive. I just downloaded Whisper and fed it a random flac file I had handy and it did a really good job. Also impressive that it works on my weak CPU:

A 3m07s flac took 5m to transcribe:

  $ whisper --device cpu 'BLACKPINK - BORN PINK/01 Pink Venom.flac'
  Detecting language using up to the first 30 seconds. Use `--language` to specify the language
  Detected language: korean
  [00:00.000 --> 00:10.000]  Blackpink
  [00:11.000 --> 00:14.000]  Kick in the door, wave in the coco
  [00:14.000 --> 00:16.000]  팝콘이는 친게 껴들 생각 말고
  [00:16.000 --> 00:19.000]  I talk to talk, run ways I walk walk
  [00:19.000 --> 00:21.000]  힘 감고 팝 팝 안 봐도 척
  [00:21.000 --> 00:24.000]  By one and two by two
  [00:24.000 --> 00:26.000]  내 손끝 두 하나에 타면 아지은 중
  [00:26.000 --> 00:30.000]  갓 자쇼 지금 화려해 T makes no sense
  [00:30.000 --> 00:32.000]  You couldn't get a dollar out of me
  [00:33.000 --> 00:38.000]  자 오늘 밤이야 눈톱을 품고
  [00:38.000 --> 00:41.000]  미혼을 뺏음 down
  [00:41.000 --> 00:43.000]  Look what you made us do
  [00:43.000 --> 00:47.000]  천천히 널 잠재울 파이어
  [00:48.000 --> 00:52.000]  잠이 날 만큼 아름다워
  [00:52.000 --> 00:53.000]  I bring the pain like
  [00:53.000 --> 00:57.000]  디스탑, 팽팽, 디스탑, 팽팽, 디스탑, 팽팽, 팽팽
  [00:57.000 --> 00:58.000]  Get em, get em, get em
  [00:58.000 --> 01:00.000]  Straight till you don't like
  [01:00.000 --> 01:01.000]  Whoa, whoa, whoa
  [01:01.000 --> 01:03.000]  Straight till you don't like
  [01:03.000 --> 01:04.000]  Ah, ah, ah
  [01:04.000 --> 01:05.000]  Taste that, pink venom
  [01:05.000 --> 01:06.000]  Taste that, pink venom
  [01:06.000 --> 01:08.000]  Taste that, pink venom
  [01:08.000 --> 01:09.000]  Get em, get em, get em
  [01:09.000 --> 01:11.000]  Straight till you don't like
  [01:11.000 --> 01:12.000]  Whoa, whoa, whoa
  [01:12.000 --> 01:13.000]  Straight till you don't like
  [01:13.000 --> 01:14.000]  Ah, ah, ah
  [01:14.000 --> 01:15.000]  Blackpink and Amo
  [01:15.000 --> 01:17.000]  Got it by the smack ram
  [01:17.000 --> 01:18.000]  But rest in peace
  [01:18.000 --> 01:19.000]  Please light up a candle
  [01:19.000 --> 01:20.000]  This the knife of a vando
  [01:20.000 --> 01:22.000]  Messed up and I'm still in saline
  …SNIP…

lunixbochs3y ago

Looks like it defaults to the model called "small".

I just ran some benchmarks - M1 Max, pytorch, with a 1.29 second flac (looks like the matrix math was running on a single thread):

    tiny
    146.522ms detect_lang
    549.131ms decode_one
    0.057ms tokenizer

    base
    354.885ms detect_lang
    1046.679ms decode_one
    0.011ms tokenizer

    small
    803.892ms detect_lang
    3194.503ms decode_one
    0.017ms tokenizer

    medium
    2279.689ms detect_lang
    10128.255ms decode_one
    0.023ms tokenizer

    large
    3656.478ms detect_lang
    17249.024ms decode_one
    0.016ms tokenizer

adgjlsfhk13y ago

For more benchmarks on an rtx 2060 (6gb), the "small" model for me is roughly 10x real-time and the tiny model is 30x real-time.

no1youknowz3y ago

This is awesome. But I really want the other way.

To be able to give it text and hear the speech. A TTS (text to speech).

As a language learner, the ability to create my own sentences (based on existing ones I have, in changing a word here or there). Would be amazing.

How long till we have this I wonder. I know I could use a service to do this currently. But having something running locally, I'd prefer.

Hopefully someone in the OpenAI team reads this. :)

TaylorAlexander3y ago

All in due time I suppose.

visarga3y ago

1 more reply

freedomben3y ago

visarga3y ago

wongarsu3y ago

w10-13y ago

Naively, training the same model on multiple languages has interesting implications.

On one hand, it may capture something "deeper" about language.

On the other hand, it's likely to do great in general, but miss particularities of some language.

Understanding the coverage of the training model seems a perennial problem. Is there any (shorthand) way to compare language model training corpora?

(btw: 25 minutes for a 9-minute segment on a 12-thread x86. Lots of jargon spelled as it sounds. Sentences capitalized but no punctuation. Overall good.)

nik_s3y ago

I just tested the model [1] using an RTX3090, trying to translate a french text I found here [2].

Some observations:

- The full translation of the 6:22 minute video takes about 22 seconds (17x real time)

- It recognizes the language by default (and did a good job to recognize it was french audio)

- MIT License [3]!

- The quality of the transcription is good, but not perfect.

- The quality of the translation (if you don't consider transcription errors as a translation error) is generally very good.

---

The transcription:

The translation:

---

All in all very happy that OpenAI is publishing their models. If Stable Diffusion is any guide, people will hack some crazy things with this.

[1] https://github.com/openai/whisper [2] https://www.youtube.com/watch?v=OFLt-KL0K7Y [3] https://github.com/openai/whisper/blob/main/LICENSE

joshcryer3y ago

seszett3y ago

> dans son ensemble

> in sound together

"Termo" also doesn't exist in French, it's "thermo", so the transcript even makes orthographic errors.

I'm still impressed by the transcript quality since it covers many languages, but the translation part is quite poor.

StevenWaterman3y ago

Was this with the `base` model? `large` is running ok on a P100 in colab, but is about 4% the speed of `base.en`. Certainly seems like some of these models will be fast enough for real-time.

solarmist3y ago

Is it translation or transcription? Or both?

Both, wow. This is really interesting.

StevenWaterman3y ago

Both, the blog covers it in detail. Pass in audio in any language, and get an English transcription out.

nik_s3y ago

It can do both - I've edited my original post to show the translation task.

NaturalPhallacy3y ago

How did you get it to use the GPU?

I have it running right now and it's not touching the GPU.

ramblerman3y ago

--device "cuda"

1 more reply

goffi3y ago

Really interesting, I can see ton of potential uses.

2 questions:

1) how does it compare to state of the art FOSS solutions? I'm seeking about DeepSpeech or Vosk

2) would it be somehow possible to associate timestamp to the words recognized? That would be amazing for things such as audio editing or skipping to a particular location on a video

nshm3y ago

goffi3y ago

> goffi

isoprophlex3y ago

Really incredible to see that their multilingual audio-to-English approach is viable. I'm super excited about this, and great to see that openai actually open up about something, for once.

Skimming the codebase I can't immediately see code to do additional training.

sowbug3y ago

No surprise that it appears to have successfully transcribed all the recordings of Harvard Sentences I could find. https://en.wikipedia.org/wiki/Harvard_sentences

lynguist3y ago

How can I use this (or something similar) for live translation? I don't mind if there's a 30s delay.

As in I don't want to input a file, I want to input the microphone sound.

blueberrychpstx3y ago

Was wondering the same.

I really wish I would have been paying attention in Unix class...

Something like `microphone | chunk 3s | whisper | stdout` would be SO COOL!!! I think that's possible but too lazy to look more.

agnos3y ago

minimaxir3y ago

FerociousTimes3y ago

What do you mean exactly by audio embeddings?

minimaxir3y ago

Although the 30 second minimum input is a bit of a bummer since it may not allow much granularity in the resulting embeddings.

aidenn03y ago

I just threw a random rock MP3 at it, and a first readthrough shows no transcription errors; this is quite good.

Now I just want OCR that's even 50% as good as this...

aidenn03y ago

Ran a few other songs through it and found one obvious mistranscription:

"He's the bedroom cosmic rocker" (should be "He's the veteran cosmic rocker" in Veteran Cosmic Rocker by The Moody Blues)

I also noticed that it's a little on the conservative side for detecting speech; all songs were missing at least part of one line.

aidenn03y ago

It took about 1000 CPU-minutes for this 5 minute song on my Ryzen 2700 with 12 OpenMP threads (about 100 minutes wall-clock).

1 more reply

macrolocal3y ago

For what it's worth, even the large model balks on Easy (Aesop Rock), eg.

"Fountainheads spittle sniglets quicker than quidditch seekers snatch golden snitches."

becomes

"Stirred up out mids bittles, snicklets, cricket and quidditch seekers net golden snitches."

¯\_(ツ)_/¯

1 more reply

Jnr3y ago

Cool!

And even with so little data to train on it still works surprisingly well.

archon14103y ago

Where do they mention what datasets they've used? I've tried looking at the paper but can't find it.

archon14103y ago

Nevermind: I found it. It's on page 19 and 20 of the paper, under Appendix A ("Evaluation Datasets").

catfan3y ago

[zalgo redacted]

dang3y ago

Hey - can you please not zalgo on HN? It messes up the threads. I've redacted it from your posts now.

IceWreck3y ago

Is there a list of system requirements somewhere ? Can it run on cheaper low memory GPUs ? maybe CPUs ?

StevenWaterman3y ago

Their models range from 70mb to 3gb. The largest model is smaller than the optimised stable diffusion. Not sure what the inference speed is like, haven't tried it myself yet.

IceWreck3y ago

I just tested it myself. Its fast enough on colab, couple of seconds but not sure if its fast enough to transcribe realtime audio yet.

2 more replies

yjftsjthsd-h3y ago

On my ancient desktop it happily fell back to running on CPU just fine.

jcims3y ago

Did respectably with some mumble rap: https://controlc.com/d353dafb

(some NSFW words in the lyrics obv)

derangedHorse3y ago

Whisper performed a lot better than I would've expected it to!

aidenn03y ago

For those on NixOS, here's a quick and dirty flake.nix that will let you make a venv in which to "pip install"'

Just put it in a flake.nix, and "nix develop" followed by "virtualenv ./venv; . ./venv/bin/activate; pip install git+https://github.com/openai/whisper.git"

    {
      description = "Python 3.9 development environment";

      outputs = { self, nixpkgs }:
        let
          system = "x86_64-linux";
          pkgs = import nixpkgs { inherit system; };
        in {
          devShells.${system}.default = pkgs.mkShell {
            buildInputs = [
              pkgs.ffmpeg
              pkgs.python39
              pkgs.python39Packages.pip
              pkgs.python39Packages.numpy
              pkgs.python39Packages.pytorch
              pkgs.python39Packages.virtualenv
            ];
          };
        };
    }

aidenn03y ago

[edit]

    {
      description = "Python 3.9 development environment";
      outputs = { self, nixpkgs }:
      let
        system = "x86_64-linux";
        pkgs = import nixpkgs {
          inherit system;
          config.allowUnfree = true;
          config.cudaSupport = true;
        };
      in {
        devShells.${system}.default = pkgs.mkShell {
          buildInputs = with pkgs; [
            cudatoolkit linuxPackages.nvidia_x11
            cudaPackages.cudnn
            libGLU libGL
            xorg.libXi xorg.libXmu freeglut
            xorg.libXext xorg.libX11 xorg.libXv xorg.libXrandr zlib 
            ncurses5 stdenv.cc binutils
            ffmpeg
            python39
            python39Packages.pip
            python39Packages.numpy
            python39Packages.pytorch-bin
            python39Packages.virtualenv
          ];

          shellHook = ''
              export LD_LIBRARY_PATH="${pkgs.linuxPackages.nvidia_x11}/lib"
          '';          
        };
      };
    }

magicalhippo3y ago

1 more reply

sergiotapia3y ago

Does this work with multiple speakers?

This is for a specific fandom of a ton of content, lots of dirty audio mostly recorded in a gym setting with multiple people speaking.

867-53093y ago

pretty sure such a tool made HN front page a few months ago

londons_explore3y ago

I've never seen transcription and translation combined into a single step like this before...

Have I been living under a rock, or is this new?

I assume it should help performance, because it means emphasis, timing and tone can be used to inform the translation. Helps make better guesses about information missing from the source language.

dindindin3y ago

I'm not in the Speech Recognition circles and am looking for open source speech recognition I can play around with - would this be the new state of the art?

mercurywells3y ago

For me as a deaf person the current state of art (in terms of speed & usability) is the Recorder app on a Google Pixel phone (4a/6 Pro is what I've used)

visarga3y ago

Most probably

StevenWaterman3y ago

Yes

amrrs3y ago

Here's a live demo on Hugging Face Spaces if you want to try - https://huggingface.co/spaces/Amrrs/openai-whisper-live-tran...

coder5433y ago

I've tried speaking to that demo several times... I used the built in feature to record from microphone, and I played back the samples to make sure they were audible and clear.

Sometimes it outputs the words "thank you" (which I did not say), sometimes it outputs a period. It never once output anything I said. It seems completely broken.

clemnt3y ago

this is amazing! got it working in French too

kiwih3y ago

obscur3y ago

Measuring performance in rounds of successful Chinese whisper

(irony)

graderjs3y ago

The big question is why is Google's speech recognition in Gboard voice typing still so shit?

https://news.ycombinator.com/item?id=32862172

MIT licensed model seems way better

nicholasjarnold3y ago

Great work OpenAI team!

BasilPH3y ago

Any opinions on what this means for speech-to-text companies like rev.ai and assmembly.ai ?

Still, curious to hear what your take on that is.

nshm3y ago

You can apply public punctation model from Vosk on top of Kaldi output, you can also get speaker labels with existing open source software.

The value of automatic summarization is small, without AI it is very hard to make it right, you need to be an expert in the field to understand what is important.

sjnair963y ago

> you can also get speaker labels with existing open source software.

Hello Nickolay :)

Diarization has always been the hard part for me, especially since it is very difficult to do comparisons within your domain. The evaluation metrics are not descriptive enough imo.

phren0logy3y ago

blakeburch3y ago

Hope someone finds this useful!

[1] https://www.shipyardapp.com [2] https://www.youtube.com/watch?v=XGr4v3aY1e8 [3] https://www.youtube.com/watch?v=xfJpGgyUkvM

liminalsunset3y ago

App idea:

Build an app that takes a video and uses Hyperaudio or a similar project to add a clickable and searchable transcript (clicking in transcript seeks video)

resoluteteeth3y ago

You could already do the speech recognition in a fully open source way with vosk easily, although Whisper may be more accurate

james-revisoai3y ago

Bayko3y ago

So I guess we can easily use this to generate subtitles?? Which would be nice! Cause ummm some of the movies that I download from the internet arrrrrr! don't have subtitles available

RockRobotRock3y ago

Dude, this is insane. This is so much better than other speech to text libraries I've tried.

jjwiseman3y ago

A second run gave better results, but in most runs I do see instances where phrases repeat from 2-20 times.

noreally_3y ago

A notebook is available to try with your microphone on Colab here: https://colab.research.google.com/drive/1nBZ-pDIaIi3N1DIIXvJ...

I'm surprised by the quality on non-English languages, given that 80+% of the training data is English, and the rest is split between tens of languages.

bambax3y ago

Thanks! I played with this in French and posted the results as replies to this comment: https://news.ycombinator.com/item?id=32928643

But it's super fun.

berberous3y ago

How do you get this to translate instead of just transcribe?

tekacs3y ago

To be more specific than the above:

paraschopra3y ago

Just specify language and record an audio in another language.

>result = model.transcribe("audio.wav", language="english")

1 more reply

lazylion23y ago

I ran it on this clip

https://clips.twitch.tv/ReliablePopularWerewolfOSkomodo-pcuw...

because... hard accent.

first run whisper thought its welsh so I had to run with --language en , and it did pretty well

https://i.imgur.com/TQiYU9X.png

took 36 seconds in Google colab

tgtweak3y ago

Good to see them releasing model weights - hopefully now that Stable Diffusion is out they will release Dall-E 2 source and weights as well.

Tistron3y ago

However, it runs very slowly. It uses the CPU on my macbook, presumably because it hasn't got a NVidia card.

LanternLight833y ago

Hoping to see this out to use in open source voice assistants, eg. mycroft

abidlabs3y ago

Here [1] is a video tutorial on building a web UI that accepts microphone input and runs it through Whisper for speech transcription

[1] https://www.youtube.com/watch?v=ywIyc8l1K1Q&ab_channel=1litt...

amrrs3y ago

Thank you for sharing!

sjsdaiuasgdia3y ago

The small.en model doesn't show these behaviors, at least in this data set.

nshm3y ago

1 more reply

rexreed3y ago

cercatrova3y ago

Their Scottish accent example is pretty good, I'd like to see it work on some very strong English accents like this one: https://www.youtube.com/watch?v=nJ7QB3om-QY

mod3y ago

Those are Irish.

angrais3y ago

Are you sure? I just ran some of Kimmy's sketches through it and ... The results are garbage.

homarp3y ago

Detected language: english

[00:00.000 --> 00:05.400] Gordy and County Kerry are investigating the theft of up to 60 sheep on Mount Brandon.

[00:05.400 --> 00:10.400] One of the farmers is offering a reward for information leading to the return of the use,

[00:10.400 --> 00:12.200] which are worth thousands of euro.

[00:12.200 --> 00:14.200] Well, I'm fine with that.

[00:14.200 --> 00:15.200] That's right.

[00:15.200 --> 00:16.200] Do you own them?

[00:16.200 --> 00:17.200] Anyone can say it.

[00:17.200 --> 00:18.200] Fine with that.

[00:18.200 --> 00:22.720] Last Saturday, Mikey Joe O'Shea brought his flock of Scotch sheep down from the mountain

[00:22.720 --> 00:25.320] commonage ahead of lambing.

[00:25.320 --> 00:29.840] He discovered over 50 were missing, allowing for a number of deaths and

[00:29.840 --> 00:30.840] strays.

[00:30.840 --> 00:34.600] Mikey is convinced over 45 sheep have been stolen.

[00:34.600 --> 00:35.600] It was a good night.

[00:35.600 --> 00:36.600] It would be a full moon there.

[00:36.600 --> 00:37.600] It would be a good night.

[00:37.600 --> 00:38.600] It would be bright out.

[00:38.600 --> 00:40.600] There could be anyone going up in the mountains.

[00:40.600 --> 00:41.600] It would be a good night.

[00:41.600 --> 00:43.600] Well, that was 45 sheep missing.

[00:43.600 --> 00:49.600] Mikey and the lambs and everything in the sheep, they counted out a nice bit of money.

[00:49.600 --> 00:52.200] They've been doing the boat in Nassan.

[00:52.200 --> 00:53.200] It's a big one. [00:53.200 --> 00:54.200] It's a big one. [00:54.200 --> 00:55.200] It's a big one.

[00:55.200 --> 00:59.000] Mikey's next door neighbor says some of his sheep have also been stolen.

[00:59.000 --> 01:00.000] Come back. [01:00.000 --> 01:01.000] Come back. [01:01.000 --> 01:02.000] Come back.

[01:02.000 --> 01:03.000] I've been missing about 10 years.

[01:03.000 --> 01:04.000] It's not all that difficult.

[01:04.000 --> 01:06.320] All they've got to do is have a good dog.

[01:06.320 --> 01:10.560] Have a good dog and go at night, some moonshine night.

[01:10.560 --> 01:11.560] Just put the dog around him.

[01:11.560 --> 01:14.120] Put him on a trailer and walk him.

[01:14.120 --> 01:18.360] And then probably somebody else to pick him up.

[01:18.360 --> 01:29.960] Everybody's doing it north, but he's doing it.

cercatrova3y ago

Wow that is incredibly impressive. At 0:53 is it translating as well? Didn't sound like English to me.

hegemon83y ago

Wow!

code513y ago

Second, is there a bug with how the script processes incoming audio segments? For a short 4 second clip, what I got was:

> [00:00.000 --> 00:03.760] Okay, Eunice, travel plans. I need to be in New York on Monday, L.A. on Tuesday, New York on Wednesday, L.A. on Thursday. You're knocking Friday. Got it?

> [00:03.760 --> 00:28.760] Got it.

However the final segment should have been shy of 1 second. It mistakenly thinks the last segment was 25 seconds long and makes you wait for processing.

Vecr3y ago

The system only works on 30 second chunks, the system needs padding (and the CLI does the padding for you).

samstave3y ago

AI speech recognition FN scares the heck out of me...

for so many reasons.

wongarsu3y ago

"Secret listening machines everywhere" was a pretty big thing in East Germany. It's also the central theme of the movie The Lives of Others.

ALittleLight3y ago

1 more reply

jffry3y ago

I'd argue that cheap, pervasive, always-on surveillance with a backlog of searchable transcriptions is a qualitatively different capability.

2 more replies

jfoster3y ago

Also, based on their demo, this model seems like it might have comprehension well above the level of a typical human.

Anyway, it's out there now. No way to turn back.

ma2rten3y ago

We will see an explosion of AI capabilities in the next couple of years. This will have a huge impact on our lives, much of it good but some of it also bad.

samstave3y ago

“Good” for ensuring you’re a compliant consumer - bad if you’re an individual person

bredren3y ago

This is dropping right in the middle of Interspeech 2022.

I don’t believe OpenAI has anyone presenting at the conference, so presumably this was timed to coincide with that and get buzz at the conference.

Curious how this model compares with foss STT from the startup Coqui.

londons_explore3y ago

@dang Can we change the link to the github here[1]?

It seems to describe the project better for a technical audience.

[1]: https://github.com/openai/whisper

londons_explore3y ago

I wonder how much the 30 second window is impacting performance?

Anecdotally, I feel like there are plenty of times that I need context from more than 30 seconds ago to understand some technical jargon that's being discussed.

O__________O3y ago

Anyone know if it is possible to output IPA using this?

International Phonetic Alphabet (IPA)

- https://wikipedia.org/wiki/International_Phonetic_Alphabet

_________

EDIT: Based on list of languages in the tokenizer code here, doesn’t appear IPA is supported:

https://github.com/openai/whisper/blob/5f8d4bcc254d4f3e833d3...

txtai3y ago

Check out this notebook for an example on how to run Whisper as a txtai pipeline in Python or as an API service: https://colab.research.google.com/github/neuml/txtai/blob/ma...

mewse-hn3y ago

>>> result = whisper.decode(model, mel, options)

Traceback (most recent call last):

[snip]

RuntimeError: "slow_conv2d_cpu" not implemented for 'Half'

It looks like a Torch error, is there some twiddling with "options" I can do to get it to run?

mewse-hn3y ago

I seem to have worked around it by tweaking the "options" line from the sample code to this:

>>> options = whisper.DecodingOptions(fp16=False)

ignite3y ago

I am running on work laptop not using GPU. (I'm running in docker). I just get

    warnings.warn("FP16 is not supported on CPU; using FP32 instead")

And it works.

1 more reply

howon923y ago

I just tried it in a few Korean YouTube videos and it's surprisingly accurate, to an extent where I would've thought it was done by a human.

harry83y ago

Can you plug this into a computer on your premises to get speech recognition without amazon, apple or google's cloud (or any other cloud) involvement?

samat3y ago

Btw, Apple's speech recognition can work completely offline, on-device. Not sure about Google or Microsoft, though.

fragmede3y ago

Yes, after the download of the model weights (from https://openaipublic.azureedge.net/) it's an entirely offline process.

NaturalPhallacy3y ago

This is pretty incredible! https://i.imgur.com/03UFGc8.gif

darkpicnic3y ago

I just wrote a script with Hazel to automatically transcribe my voice notes to txt. It handles punctuation extremely well. What a wonderful contribution!

pbassham3y ago

Exactly what I was planning to do! Want to share yours with me?

emcq3y ago

jefftk3y ago

petercooper3y ago

I think it'd be a lot harder to make a case for an accurate audio to text transcription being seen to violate the copyright of any of the training material in the way a visual could.

1 more reply

bscphil3y ago

> models learning from data does not make the output of the models a derivative work of that data

(I think there's a valid question about whether models represent "derivative work" in the GPL sense specifically, but I'm using the idea more generally here.)

1 more reply

emcq3y ago

nshm3y ago

I think they didn't use WSJ for training, only for evaluation. Paper includes WSJ under "Evaluation datasets"

pabs33y ago

Kirkman143y ago

I've been trying Whisper on my old setup (Mac Pro 2012 running Mojave, with Radeon RX 580), and it's a pretty amazing tool.

Unfortunately my system is not ideal for today's AI tools. Whisper runs only on the CPU, and it's slow.

I know PyTorch recently added Metal support, but only for M-based Macs. Has anyone found a way to make it work with Intel Macs?

wodenokoto3y ago

Is it also a translation model? All the example transcripts are in English, regardless of the language of the purportedly transcribed audio.

The description makes it sound like it is a model for transcribing English audio.

> We’ve trained and are open-sourcing a neural net called Whisper that approaches human level robustness and accuracy on English speech recognition.

localy3y ago

Are there any published benchmarks available outlining how this compares to other open source ASR software, such as Coqui.ai?

garfieldnate3y ago

Maybe with this we'll finally get simple bilingual NLU so I can walk around Obi talking to my phone. "Siri, what's Hochlochziegel in English?"

chrisstanchak3y ago

Hold on to your papers

smusamashah3y ago

I guess I will need to download and run on it to see how correct it is.

zeagle3y ago

blueberrychpstx3y ago

This is absolute garbage python as I am neither a python developer, nor a good developer. I was trying to play around with real time transcriptions. However, it does work!

> * recording * done recording Recording saved to file.wav Press enter to transcribe

Any improvements welcome here.

``` # This is a sample Python script.

# Press ⌃R to execute it or replace it with your code. # Press Double ⇧ to search everywhere for classes, files, tool windows, actions, and settings.

def print_hi(name): # Use a breakpoint in the code line below to debug your script. print(f'Hi, {name}') # Press ⌘F8 to toggle the breakpoint.

def record_microphone(seconds): import pyaudio import wave

    CHUNK = 1024
    FORMAT = pyaudio.paInt16
    CHANNELS = 1
    RATE = 44100
    RECORD_SECONDS = seconds
    WAVE_OUTPUT_FILENAME = "file.wav"

    p = pyaudio.PyAudio()

    stream = p.open(format=FORMAT,
                    channels=CHANNELS,
                    rate=RATE,
                    input=True,
                    frames_per_buffer=CHUNK)

    print("* recording")

    frames = []

    for i in range(0, int(RATE / CHUNK * RECORD_SECONDS)):
        data = stream.read(CHUNK)
        frames.append(data)

    print("* done recording")

    stream.stop_stream()
    stream.close()
    p.terminate()

    wf = wave.open(WAVE_OUTPUT_FILENAME, 'wb')
    wf.setnchannels(CHANNELS)
    wf.setsampwidth(p.get_sample_size(FORMAT))
    wf.setframerate(RATE)
    wf.writeframes(b''.join(frames))
    wf.close()

    return WAVE_OUTPUT_FILENAME

        result = model.transcribe(filename)
        print(result["text"])

```

manishsharan3y ago

pabs33y ago

DeepSpeech got spun out of Mozilla to coqui.ai and they are continuing the open nature of the project.

runlevel13y ago

I ran it on some fire department radio recordings from scanners on Broadcastify. It did remarkably well.

For reference, GCP's Speech-to-Text didn't detect any speech from this clip -- even when using the enhanced phone model.

FloatArtifact3y ago

This would be a cool thing to integrate into Dragonfly https://github.com/dictation-toolbox/dragonfly

synkarius3y ago

It would. I wonder how this compares with Kaldi, one of the two open source speech recognition engines that Dragonfly currently supports.

sn413y ago

Most of the comments here are about law enforcement. I would like to point out that it might be a boon for dictation software. This may make it easier to dictate text/code etc. in any environment.

sva_3y ago

It seems like Stable AIs release has led to some real disruption in the ML field regarding open source, and this doesn't seem to be limited to image generation. Excited to see what comes next.

gareth_untether3y ago

I'm thinking of releasing a plugin in for Unity to that can be used to match a phrase to an action. Seeing Whisper is making me think I should include a way to use voice and not just text.

archibaldJ3y ago

Is this practical to be used on the "edge" (for voice-control)? Would love to know if anyone has a rough idea roughly how fast/slow this would be on a M1 Mac or V100

Simorgh3y ago

I’ve been experimenting with voice-interfaces where typing is replaced by talking, but I find it hard to transition users to voice - we ‘seem’ to prefer typing to talking.

I wonder if this will change.

ironlake3y ago

alexb_3y ago

Combine the translation + transcription with voice synthesis, and once compute power allows for this to be miniaturized we will be able to have babel-fish technology in real life.

simmanian3y ago

Could someone tell me whether it's possible to somehow feed data into this project to improve its translation and transcription capabilities on our own?

spywaregorilla3y ago

Hmm are there any noteworthy open sourced speech to speech models? Like transform a spoken line to another voice, copying both the words spoken and the inflections?

powera3y ago

My first take: it is slow.

The "base" model (supposedly 16x faster than the large one) takes more than the audiofile playback time on my machine to do transcriptions.

fitznd3y ago

z3t43y ago

Why not make a demo that you can try out via navigator.mediaDevices.getUserMedia . Of course you will get good results if you demo using the training set.

yawnxyz3y ago

Oh man I remember LOVING Micro Machines as a kid.

But also, this tool seems much better than Otter.ai, which gets every third word wrong when transcribing microbiology recordings

anigbrowl3y ago

Oh nice - I have an immediate use case for this. This looks accessible enough that the sci-fi dream of instantaneous audio translation is suddenly within reach.

revskill3y ago

It's actually better than Google Meet subtitle system.

dubeye3y ago

I know a manual transcription company, which is still seeing modest growth from existing clients who also use ASR, so it's not quite there yet

rlt3y ago

As a casual observer I get the sense that OpenAI and others are very rapidly creating building blocks of something much bigger…

Gazoche3y ago

Pretty cool, and it seems to work on AMD GPUs as well. I've just tried it on my RX6800 with the ROCm build of PyTorch.

michelb3y ago

Quite a high error rate on a very clean-spoken Dutch audio, but way better than anything else I have tried.

howon923y ago

I just tested it on a few of my YouTube videos in Korean and it's surprisingly good at transcription.

jerpint3y ago

I recorded myself speaking French and was able to translate decently well on my laptop. Very impressive!

hijp3y ago

Anyone get it running on m1 mac?

I keep getting `ModuleNotFoundError: No module named 'setuptools.command.build'`

Smaug1233y ago

dceddia3y ago

2 more replies

simmanian3y ago

kif3y ago

I got requirements installed, but then when running the Python example, I get:

RuntimeError: "slow_conv2d_cpu" not implemented for 'Half'

kif3y ago

Probably need to pass some kind of options when initializing. The command itself works fine, just shows a warning: warnings.warn("FP16 is not supported on CPU; using FP32 instead")

1 more reply

dceddia3y ago

Yep, I had this too. `pip3 install -U pip setuptools` took care of it. (If you get an error about pip3, try `pip` instead)

hijp3y ago

I'm really new to pip, but does this look ok?

  note: This error originates from a subprocess, and is likely not a problem with pip.

error: subprocess-exited-with-error

2 more replies

knaik943y ago

I got a super weird results with the 'medium' and language Japanese (with a --task translate). The song is False Sympathy by Mondo Grosso.

It shows up in the yt rip in format 251 (opus), but not in format 140 (aac from youtube), nor the flac rip. All three are giving different results.

I am using the official subtitles released on the youtube video.

On my computer I wasn't able to use the large model because I ran out of VRAM, I have 8gb, not sure how much more it'd require. So I ran it with medium.

The song is False Sympathy by Mondo Grosso. The mv is suggestive, in case that matters. I grabbed a fresh audio rip from Youtube because I didn't want to take it out of my cd case.

https://www.youtube.com/watch?v=B6Y-WsgpzlQ

It is translating this version differently from the director's cut version. I ripped both as opus.

There is something weird about how it is handling the opus encoded version, as I find the same "Translated by Releska" in a wav version transcoded from the opus.

adeptima3y ago

Japanese output will produce lot of tiny mistakes. However the whole output is still good enough. Like 95% plus good enough.

ksdk3y ago

Where do you think this place services like Otter.ai, Descript, etc.?

funhighway3y ago

Would be nice to give more details about the provenance and construction of the training data.

bickett3y ago

Hard to keep up with all the great things. The AI community is really moving quick right now.

braindead_in3y ago

Why build a separate model when you can integrate it right into GPT?

throwamon3y ago

Is it feasible to use this for Talon-like voice-driven computer usage?

lunixbochs3y ago

Talon's speech engine backend is modular, with Dragon, Vosk, the WebSpeech API, and Talon's own engine all used in different ways by users.

FloatArtifact3y ago

Maybe, a number of speech recognition engines have been integrated into https://github.com/dictation-toolbox/dragonfly

pmarreck3y ago

So it's 100% better than Siri's speech dictation, I see

eugenhotaj3y ago

Now someone just needs to pipe the output into stable diffusion.

jdmoreira3y ago

Looking forward to see if this works well with foreign accents

mminer2373y ago

They have an example in the post with a very thick Scottish accent. You should listen to it. It's pretty impressive.

LoveMortuus3y ago

This could be used to make some really cool RPG games!

dot1x3y ago

That's all good and great, now please do OCR...

Havoc3y ago

This could be really cool for mycraft/rasphy etc

nothrowaways3y ago

Great project, not so great package name.

tullie3y ago

Great to see OpenAI finally being open :)

synergy203y ago

is there a high quality text to speech equivalent project like this?

bergenty3y ago

gck13y ago

Result of my own recording:

  Detected language: georgian
   ᔨᴉᴉ�ちゃんᓁᔇ � remnants ᡔ� founding ហ�ockey� slee សᕁ �eling ភᕩ�icularly អᕖᕤ�APPLAUSEPS ថ�Dav頻道 ប�DING� Możai បፘ្ទក ុក ឵� orchestral ុក ឵� arter ូ� Brettំ � 
  hilarious ល ឬ ᔼ� vårក បក ្៙ � Poll statements ឭ᪨្pson. ჩჩრუესი჏მეისლემვეერრშუეაირელმირისასასსსესსერერსივეესრრილმეხრე რეიმიმეფემსესე�

Results of clear Georgian audio [1].

On tiny model:

  Detected language: georgian
  [00:00.000 --> 00:21.560]  én
  [00:21.560 --> 00:23.240] 我伦伦…
  [00:23.280 --> 00:43.720] 我伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦因为b forestry

On medium model:

  Detected language: georgian
   სრჱირესრრრრრრრრრრრრრნსსსრრრრრეე რრირრრრრრრრრე რსრნგნრრრრსრრრრრრრორრრრრრრრრრრ� ḵḸḇḤḾḤḾḤḾḤḾḤḾḤḾḤḾḤḾḾḤḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾ� ḥḾḼḥḾ 
  ḥḾḾ ḥḾḾ ḤḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾ� ḲḵḽḻḽḾ Ḫḵḽḻḽ so� ḻḽḽ ḻḽḻḻḽ ḱᴇ᷻ᵒ ḳᶟᄤḱ ḯᵁ Ḳᴄᴍᴆ Ḧᴍ� Ḧᵒ ḳᴍᴇ ḽᴄᴍᴛᴄ Ḧᴇᴆ ḳᵗᴇ ḽḮᴆ Ḫᴇᴾ ḿᴏᴇᴄᴄᴏ 
  ច�izar� wait �ห� examined ᑇទមះៈេំ supervision ង� იეეეეეეეეეეეეეეეეე მაეე ეაეეეეეეეეეეეეეეეეეეეე დაეეეეეეეეეეეეე უეეეეეეეეეეეეე ეა� ჆ მიი სმეიი მმიეი Ⴢქ სიიეი 
  სავიე სიიითთიიმემი, რაეე სიიმე სიიი ღიიიიწეირი საეიეიი სიიეი სი� ვეეფვეიიიე ქლეეშეეროეეეეეეეეეეეეე. ეგეზ ეყაკშეიეეეეეეეეეეეეეეეეეეეეეეეეეეეეეა, ნრროპიროო მმუმინ 
  სეეკნფეე სეეჍიგოშ სჟებიმელელეეკირპიე სემეიმე სეეიმმმ სეენემეეი სე� ᑦ� Famose m인데요 hqe bywall jaini threshold ji jani den poder vlogging bywall Take the text Ba 
  tou yodamj je te shake ba te shake baou contour but whatever Baou cube baou cup Baou rope Baou people Qeful Qeful იმიიიმიბთმითიიითიიიიიიიი 
  რაოეოოოენპეეეიეიიიიიიიიიომიიიიიიიიი რიიიიიიიიიიიმიი� ნსეეეეეეეეეეეეეეე სარეეეეეეეეეეეეეეეეეეეეეეეეეეეეეეეეეეეეეეეეეეეეეეეეეეეეეე� მጇი჏ვ ეეეიდჼვვ ნაბდადებ 
  ლმირეეეეფედუივევეეეიიეეეეე რარეიეეეევეეეეევეე სარრეეეეეეეეეეეეეეეეეეეეეეეეეეე ხშიიიიიიიიიიიიი ლიიიიიიი ლიიიიიიიიიი ლიიი ლიიიიიიი ლაიიიიი ეიიიიიიიიიიიიიიი იიიი მ�

I've also tested it on few other audio inputs and it failed to produce meaningful results on all of them with all models.

  whisper audio.wav --language Georgian --task transcribe --model tiny
  [00:00.000 --> 00:02.000]  «Зураб Герча Джапарзис Ганц Хатеваром
  [00:02.000 --> 00:04.000]  умерен цупасу Хизгеблоту кащепаста
  [00:04.000 --> 00:06.000]  а опозационермии член шонахлари
  [00:06.000 --> 00:07.000]  с дрородисат Сакартолом
  [00:07.000 --> 00:09.000]  с акутаритеритория бюнда дай бронос
  [00:09.000 --> 00:10.000]  та тасовый торуси сам кадр
  [00:10.000 --> 00:12.000]  Сакартоломший ровно украйенисту
  [00:12.000 --> 00:13.000]  щойго екнебо
  [00:13.000 --> 00:14.000]  амсясахеб кирчи метитаусу
  [00:14.000 --> 00:15.000]  хлебислидерма
  [00:15.000 --> 00:17.000]  уцноктангадацема щейсяа уградунца
  ...

[1] https://www.youtube.com/watch?v=rE_zx_6RhL0 [2] https://www.youtube.com/watch?v=elrXgO8hjtI

faizsn3y ago

Faizan

arpankapoor3y ago

I tried it out on a Hindi speech (https://www.youtube.com/watch?v=4EpfJxKyosE). The transcription starts off decent, but kind of gets stuck repeating the same thing at the 02:40 mark:

    [00:00.000 --> 00:10.000]  पचास ताल में हमने प्रगती किये, इससे को इंटार नहीं कर सकता।
    [00:10.000 --> 00:20.000]  छुनाओ के दौरान वोट मांगते हुए, सरकार की नीतियों पर कठोर से कठोर प्रहार करते हुए,
    [00:20.000 --> 00:28.000]  और पुरानी सरकार की नीतियों नहीं आलोचना करने के लिए लैक बहुत सामग्री थी।
    [00:28.000 --> 00:35.000]  हर जगे मैंने ये कहा कि मैं उन लोगों में से नहीं हूँ, जो पचास वर्च की उपलड्यों पर पानी फिर दे।
    [00:35.000 --> 00:43.000]  ऐसा करना देश के पुर्षार्थ पर पानी फिरना होगा। ऐसा करना देश के किसान के साथ अन्याय करना होगा।
    [00:43.000 --> 01:01.000]  मल्दूर के साथ जात्ती करनी होगा। आम आद्मी के साथ भी वो अच्छा व्योहार नहीं होगा। जो स्वाल आज मन में उच्छा है और उच्छना चाही है। आदावी को पचास साथ होने आये, हम जैनती मनाने जा रहे हैं।
    [01:01.000 --> 01:18.000]  आज देश की स्तिती क्या है। हम पिछर के होगे हैं। प्रगती की दोड़ में, जो देश हमारे साथ आजाद हुए थे, वो हम से आगे बढ़ के। जो देश हमारे बाच जन में थे, वो हमें पीचे छोड़ थे।
    [01:18.000 --> 01:34.000]  दुनिया के गरी तम देशों में हमारी गड़न आये। वीस फीज़ी से जाना लो गरीबी की रेका के नीचे। राक्तपती महुदाय के विभाशन में गाऊं का उल्लेक हैं ना पीरे का पानी नहीं।
    [01:34.000 --> 01:50.000]  हम प्राथमी शिक्षा अनिवारे नहीं कर सकते हैं। लड्कियों की शिक्षा की उपेक्षा हो रही हैं। लड्कि का जन्म लेना तो इस देश में अभी तक एक अभिशाप है।
    [01:50.000 --> 02:07.000]  क्या सरकारी कदम उठाकर समाज में जाग्दृती पैदा करकें। क्या सब लोगों को जुटाकर ये तो ऐसा काम है जिस में कोई दलबंदी के लिए इस्थान नहीं। हम देश का नक्षा नहीं बदल सकते हैं। देश में साधनों की कमी नहीं है।
    [02:07.000 --> 02:07.000]  और साधनों की अगर कमी है तो उसको ठीक दन्त से प्राप्त किया जा सकता है। साधन बड़ाए भी जा सकते है। लेकिन जो साधन हैं उनका ठीक उपयोग नहीं हो रहा। जंता के उपर टेक्स लगाकर जो दन्नि कप्ता किया जाता है। उसका लाग जंता तक नहीं पहु
    [02:37.000 --> 02:37.000]  रख्कम जाती है। विदेशी बैंको में दन जाने का सिल्सिला अभी तक क्यों काएं है। उसको लोकने के लिए क्या कदम उठाएगे। हम विदेशी पूजी के लिए प्रैत्रशील हैं विदेशी पूजी आए और अगर विदेशी पूजी आती है अच्छे दन्त की टेक
    [03:07.000 --> 03:07.000]  अच्छे दन्त की पूजी आती है अच्छे दन्त की पूजी आती है अच्छे दन्त की पूजी आती है अच्छे दन्त की पूजी आती है
    [03:37.000 --> 03:39.000]  अच्छे दन्त की पूजी आती है अच्छे दन्त की पूजी आती है
    [04:07.000 --> 04:09.000]  अच्छे दन्त की पूजी आती है अच्छे दन्त की पूजी आती है
    [04:37.000 --> 04:39.000]  अच्छे दन्त की पूजी आती है अच्छे दन्त की पूजी आती है

The translation does a much better job however:

    [00:00.000 --> 00:10.000]  In the last 50 years, we have made progress, no one can deny this.
    [00:10.000 --> 00:20.000]  During the elections, while asking for votes, while attacking the government's policies harshly,
    [00:20.000 --> 00:28.000]  and to criticize the policies of the old government, a lot of material was needed.
    [00:28.000 --> 00:35.000]  Everywhere, I have said that I am not one of those people who pour water on the fruits of 50 years.
    [00:35.000 --> 00:39.000]  To do this, we will have to pour water on the efforts of the country.
    [00:39.000 --> 00:43.000]  To do this, we will have to do injustice with the farmers of the country.
    [00:43.000 --> 00:45.000]  We will have to do caste with the laborers.
    [00:45.000 --> 00:50.000]  Even with the common man, that will not be a good behavior.
    [00:50.000 --> 00:55.000]  The question that arises in the mind today and should arise,
    [00:55.000 --> 01:01.000]  Freedom has come to be 50 years, we are going to celebrate.
    [01:01.000 --> 01:04.000]  What is the situation of the country today?
    [01:04.000 --> 01:07.000]  Why did we get separated?
    [01:07.000 --> 01:14.000]  In the race of progress, the country that got freedom along with us, they went ahead of us.
    [01:14.000 --> 01:19.000]  The country that was after us, they left us behind.
    [01:19.000 --> 01:25.000]  In the poorest countries of the world, they counted us.
    [01:25.000 --> 01:29.000]  20% of the population is below the poverty line.
    [01:29.000 --> 01:35.000]  In the speech of the President, there is no mention of villages or drinking water.
    [01:35.000 --> 01:39.000]  We cannot enforce primary education.
    [01:39.000 --> 01:43.000]  The education of girls is being neglected.
    [01:43.000 --> 01:50.000]  The birth of a girl is still a curse in this country.
    [01:50.000 --> 01:55.000]  Is it by taking government steps, by creating awareness in the society?
    [01:55.000 --> 02:01.000]  Is it by uniting all the people that there is no place for party?
    [02:01.000 --> 02:05.000]  Can't we change the map of the country?
    [02:05.000 --> 02:08.000]  There is no shortage of resources in the country.
    [02:08.000 --> 02:14.000]  And if there is a shortage of resources, it can be obtained in the right way, resources can be increased.
    [02:14.000 --> 02:21.000]  But the resources that are there, they are not being used properly.
    [02:21.000 --> 02:30.000]  The wealth that is collected by taxing the public, its profit does not reach the public, it does not reach the common man.
    [02:30.000 --> 02:32.000]  Where does it go?
    [02:32.000 --> 02:35.000]  Whose pockets are filled?
    [02:35.000 --> 02:39.000]  Whose treasury does that money go to?
    [02:39.000 --> 02:44.000]  Why is the chain of money going to foreign banks still established?
    [02:44.000 --> 02:47.000]  What steps have been taken to stop it?
    [02:47.000 --> 02:52.000]  We are motivated for foreign worship, foreign worship has come.
    [02:52.000 --> 03:01.000]  And if foreign worship comes for good technology, for infrastructure,
    [03:01.000 --> 03:06.000]  for education, then no one will object.
    [03:06.000 --> 03:11.000]  I believe that our communist friends will not object either.
    [03:11.000 --> 03:19.000]  But is the maximum use of the resources in the country happening?
    [03:19.000 --> 03:26.000]  Is it not true that corruption has become a national disease?
    [03:26.000 --> 03:31.000]  I remember that Swargi Rajiv Gandhi had said in a speech that I send one rupee from Delhi,
    [03:31.000 --> 03:36.000]  but where I send the rupee, as I reach there, 19 paise are left.
    [03:36.000 --> 03:41.000]  I asked him how this miracle happens.
    [03:41.000 --> 03:47.000]  Bhaskar said that when the rupee runs, it shrinks.
    [03:47.000 --> 03:54.000]  The rupee shrinks, it gets into the hand, it goes into the pocket, it becomes small.
    [03:54.000 --> 03:58.000]  It is difficult to recognize the rupee.
    [03:58.000 --> 04:02.000]  The rupee can be hidden.
    [04:02.000 --> 04:06.000]  The situation of the currency of the country is not good.
    [04:06.000 --> 04:10.000]  First, the government expenditure has increased, it is increasing.
    [04:10.000 --> 04:17.000]  It needs common consent to reduce without reducing.
    [04:17.000 --> 04:24.000]  No one can work in the same way.
    [04:24.000 --> 04:27.000]  Yes, our old Prime Minister Narasimha Raoji,
    [04:27.000 --> 04:34.000]  if he would have tried in this direction after stabilizing himself, then he would have succeeded.
    [04:34.000 --> 04:47.000]  But he was stuck in some such things that he could not pay attention to these problems.

biggerChris3y ago

We have reached sentient mode.

j / k navigate · click thread line to collapse