The 4 examples are stunningly good (the examples have speakers with heavy accents, speaking in foreign language, speaking with dynamic background noise, etc.), this is far and away better than anything else I've seen. Will be super curious to see other folks trying it out and seeing if it's as robust as it seems, including when confronted with audio speech with natural tics and uhhh's and uhmm's and everything in-between.
I think it's fair to say that AI-transcription accuracy is now decidedly superior to the average human's, what the implications of this are I'm not sure.
Existing (and affordable) offerings are so good that they can cope with shitty recordings off a phone speaker and maintain ~97% accuracy over hour-long conversations. I'm sure it's been an absolute godsend for law enforcement other people who need to gather poor-quality audio at scale, though much less great for the targets of repressive authority.
Having this fully open is a big deal though - now that level of transcription ability can be wrapped as an audio plugin and just used wherever. Given the parallel advances in resynthesis and understanding idiomatic speech, in a year or two I probably won't need to cut out all those uuh like um y'know by hand ever again, and every recording can be given an noise reduction bath and come out sounding like it was recorded in a room full of soft furniture.
I will try to put the code to the test, see how it goes.
Perhaps it will encourage people to add voice command to their apps, which can be sent to gpt3
From what I can gather:
1. Includes model weights. I can't find the URL, but they reference them enough and have a CLI tool, so I presume I just haven't found them yet.
2. Includes code: https://github.com/openai/whisper
3. Released under MIT License: https://github.com/openai/whisper/blob/main/LICENSE
For a company that raised $1B, that's not exactly living up to their name and original mission.
I can understand not releasing GPT-3, even if I disagree with the decision.
"tiny.en": "https://openaipublic.azureedge.net/main/whisper/models/d3dd5..."
"tiny": "https://openaipublic.azureedge.net/main/whisper/models/65147..."
"base.en": "https://openaipublic.azureedge.net/main/whisper/models/25a85..."
"base": "https://openaipublic.azureedge.net/main/whisper/models/ed3a0..."
"small.en": "https://openaipublic.azureedge.net/main/whisper/models/f953a..."
"small": "https://openaipublic.azureedge.net/main/whisper/models/9ecf7..."
"medium.en": "https://openaipublic.azureedge.net/main/whisper/models/d7440..."
"medium": "https://openaipublic.azureedge.net/main/whisper/models/345ae..."
"large": "https://openaipublic.azureedge.net/main/whisper/models/e4b87..."
[0] https://www.youtube.com/watch?v=DS6pE88Xg3s
[1]
$ yt-dlp --extract-audio --audio-format mp3 -o wire-fuck.mp3 https://www.youtube.com/watch?v=DS6pE88Xg3s
$ whisper --language en wire-fuck.mp3
[00:00.000 --> 00:02.000] Oh
[00:13.260 --> 00:15.260] Fuck
[00:15.260 --> 00:31.260] Motherfucker
[00:50.700 --> 00:52.700] Fuck
[00:52.700 --> 00:58.700] Oh
[00:58.700 --> 01:10.700] Fuck
[01:28.700 --> 01:55.900] Fuck
[02:02.340 --> 02:03.700] Motherfuck.
[02:10.220 --> 02:11.220] Oh, fuck.
[02:11.780 --> 02:12.780] Oh, fuck.
[02:25.900 --> 02:27.900] Fuck, fuck, fuck, fuck, fuck, fuck.
[02:27.900 --> 02:28.900] Motherfucker.
[02:32.900 --> 02:33.900] Oh, fuck.
[02:34.900 --> 02:35.900] Fuck.
[02:35.900 --> 02:36.900] Oh, fuck.
[02:36.900 --> 02:37.900] Oh, fuck.
[02:37.900 --> 02:38.900] Oh, fuck.
[02:48.900 --> 02:49.900] Motherfucker.
[02:53.900 --> 02:54.900] Fucking A.
[02:54.900 --> 02:56.900] Mm hmm.
[02:56.900 --> 03:12.900] Fuck.
[03:26.900 --> 03:28.900] Motherfucker.
[03:28.900 --> 03:32.900] Fuck me.
[03:58.900 --> 04:01.900] Oh.
[04:28.900 --> 04:34.900] Fuck. Speaker 0 00:00:12 Oh, fuck motherfucker. Okay. Fuck, fuck, fuck, fuck, fuck, fuck, fuck, fuck.
My little fuck.
Speaker 1 00:02:10 Oh, fuck. Oh, fuck,
Speaker 0 00:02:25 Fuck, fuck, fuck, fuck, fuck, fuck, fuck, fuck my motherfucker.
Speaker 1 00:02:53 Fucking a.
Speaker 0 00:02:54 Mm-hmm. <affirmative> motherfucker. Fuck me. Um,EDIT: Tried it and it worked great! It is very easy to use. I just did the pip install line in the readme and was ready to go. You literally just run the one pip install line, and then you run the program in the format "whisper my_audio.wav" and it goes. Really nice job OpenAI!
Whisper SoTA
LibriSpeech test-clean 2.7% 1.8%
LibriSpeech test-other 5.6% 2.9%
Switchboard 13.1% 4.9%
CallHome 15.8% 9.5%
The authors do explicitly state that they're trying to do a lot of fancy new stuff here, like be multilingual, rather than pursuing just accuracy.Comparing the readily available test sets from the paper to some of my personal robust models (for the Talon models, this is greedy decoding, no language model):
Talon Talon Talon Whisper wav2vec 2.0
28M 300M 1B Large 960h
librispeech clean 3.21 2.52 2.40 2.7 2.7
librispeech other 8.21 6.56 5.63 5.6 6.2
common voice 13.88 11.65 8.86 9.5 29.9
tedlium 7.51 6.55 5.47 4.0 10.5
I have a battery of more difficult tests on hand (including adversarial tests, and diverse accent-specific metrics). I'll look at running these tests on each of the Whisper model sizes and following up with a larger comparison.> Because Whisper was trained on a large and diverse dataset and was not fine-tuned to any specific one, it does not beat models that specialize in LibriSpeech performance, a famously competitive benchmark in speech recognition. However, when we measure Whisper’s zero-shot performance across many diverse datasets we find it is much more robust and makes 50% fewer errors than those models.
What an interesting approach. What benefits does this have over having two dedicated models, one for speech-to-text, and another for translation?
It just seems so odd, given the problems of speech-to-text and Spanish-to-English seems so different from one another (in terms of the problem domain). Seems so unusual to have both handled by one model!
Does knowledge of speech-to-text carry over into knowledge of translation? Does knowledge of translation carry over into knowledge of speech-to-text? So weird.
We also see in image generation models that multi-modal networks are more powerful than single purpose networks. As we move towards more advanced AI systems I suspect we will see more and more generalizable networks with distinct advantages over separate networks that get plugged together.
Now I wonder if it works equally well with Spanish from Spain (and its different regions) and Spanish from the New World (and in its myriads of different flavours).
If you want to give it a shot, you can find the python script in this repo: https://github.com/tobiashuttinger/openai-whisper-realtime
A bit more context on how it works: The systems default audio input is captured with python, split into small chunks and is then fed to OpenAI's original transcription function. It tries (currently rather poorly) to detect word breaks and doesn't split the audio buffer in those cases. With how the model is designed, it doesn't make the most sense to do this, but i found it would be worth trying. It works acceptably well.
Have you thought of using VAD (voice activity detection) for breaks? Back in my day (a long time ago) the webrtc VAD stuff was considered decent:
https://github.com/wiseman/py-webrtcvad
Model isn’t optimized for this use but I like where you’re headed!
Took マッコウクジラ14頭が海岸に打ち上げられる オーストラリア(2022年9月21日) https://www.youtube.com/watch?v=bZkNIzeRBk4
Extracted audio with youtube-dl -f bestaudio https://www.youtube.com/watch\?v\=bZkNIzeRBk4
Converted into [00:00.000 --> 00:13.000] オーストラリア南部の島で、真っ向くじら14棟が海岸に打ち上げられて死んでいるのが見つかり、専門家が調査のため原地入りしました。 [00:13.000 --> 00:25.000] 原地メディアによりますと、オーストラリア南部のキング棟で、19日、少なくとも14棟の真っ向くじらが海岸に打ち上げられて死んでいるのが見つかりました。 [00:25.000 --> 00:31.000] ほとんどが若いオーストを見られ、専門家が現場に重むき調査に当たっています。 [00:31.000 --> 00:41.000] くじらの死害は大きく運んだり埋めたりすることが難しいため、自然に分解されるのを待つ方針が検討されています。 [00:41.000 --> 00:52.000] また、死害を狙い、サメが海に集まる可能性があるとして、原地東局はサーファーなどに周囲に近づかないように呼びかけています。 [00:52.000 --> 01:02.000] 一方、21日にはタスマニア棟でおよそ230棟のくじらが浜辺に打ち上げられた状態で見つかりました。 [01:02.000 --> 01:07.000] およそ半数がまだ生きている模様で急助活動が進められています。 [01:07.000 --> 01:23.000] 見つかったのは、ゴンドーくじらの仲間と見られています。
Here are the exact steps to follow to get it running on Ubuntu 22.04 via WSL and yt-dlp:
1. pip install git+https://github.com/openai/whisper.git
2. yt-dlp -f 'ba' -x --audio-format mp3 https://www.youtube.com/watch/?v\=bZkNIzeRBk4
3. renamed the file to test.mp3
4. whisper test.mp3 --language Japanese --task translate --model large
Note: the large model will download a ~3Gb fileI just wonder what system requirements Whisper has and whether there are open source voice recognition models that are specifically built for embedded devices.
Edit: According to this comment[0] the base model runs in real time on an M1 CPU. The tiny model apparently decodes an audio file twice as fast. These are promising results.
The Mycroft has done a lot of cool and important work in the field to ship an actual personal assistant product (stuff like wake word detection).
[00:00.000 --> 00:06.500] Since the last one started, the number of times I've eaten has decreased.
[00:06.500 --> 00:11.000] If I get too carried away with the last one, I'll get hungry and do it.
[00:11.000 --> 00:14.500] I don't have time to eat.
[00:15.500 --> 00:18.000] I'm going to eat now.
[00:20.000 --> 00:23.000] It's going to take about 10 minutes from here.
[00:23.000 --> 00:31.000] It's been a while since I've had my last meal.
[00:31.000 --> 00:36.000] I feel like I'm losing my女子力.
[00:36.000 --> 00:39.000] I have to go back to my original self.
[00:39.000 --> 00:44.000] I have to get ready and go to bed.
[00:44.000 --> 00:46.000] It's not good.
[00:46.000 --> 00:51.000] I've been drinking a lot lately, so I'm going home.
[00:51.000 --> 00:53.000] I have to get my nails done this fall.
[00:53.000 --> 00:54.000] Halloween nails.
[00:54.000 --> 00:57.000] Halloween, Halloween, Halloween.
[00:57.000 --> 00:59.000] I'm going to the beauty salon today.
[00:59.000 --> 01:02.000] I'm going to get my nails done the day after tomorrow.
[01:02.000 --> 01:10.000] I used to look at a lot of clothes, but I stopped looking at them.
[01:10.000 --> 01:12.000] I'm going crazy.
[01:12.000 --> 01:22.000] My stomach's stopped in the middle of summer.Though I assume the amount of Norwegian it has been exposed to is fairly limited, so in that light I'm actually impressed as well.
I tried it on a news segment from the radio[1], this is the large model output:
[00:14.000 --> 00:17.200] En skamløs krenking av FN pakten.
[00:17.200 --> 00:24.000] USAs president og verdensledere svarer på den russiske presidentens atomtrusler og krigsmobilisering.
[00:25.500 --> 00:29.400] Arbeidsklær som er ment til å være til begge kjønn, har det med å være tilpasset.
[00:29.400 --> 00:33.400] Men hvordan ville det gått, om det var motsatt?
[00:34.100 --> 00:38.900] Dyrevernsorganisasjon vil ha digital merking av regnstyr,
[00:38.900 --> 00:44.900] men næringen selv insisterer på den gamle tradisjonsrike måten med rissing av kniv.
[00:45.600 --> 00:51.400] Mange strømselskaper er positive til å tilby kundene fastpris på strøm, og det årevis.
[00:51.400 --> 00:59.900] Da risikerer de å måtte betale mye i nettopp åretsvis, sier aktører som aldri tilbyr fastpris.
[00:59.900 --> 01:21.900] Dette er onsdagens Dagsnytten. Jeg heter Espen Ås.
For reference, here's what he actually said, from the source[1] itself: * En skamløs krenking av FN-pakten. USAs president og verdensledere svarer på den russiske presidentens atomtrusler og krigsmobilisering.
* Arbeidsklær som er ment å være til begge kjønn, er som regel tilpasset ... menn. Hvordan hadde det gått om det var motsatt?
* Dyrevernsoganisasjon vil ha digital merking av reinsdyr, men næringen selv insisterer på den gamle tradisjonsrike måten med rissing av kniv.
* Mange strømselskaper er positive til å tilby kundene fastpris på strøm - og det i årevis.
- Da risikerer de å måtte betale mye i nettopp; årevis, sier aktør som aldri tilbyr fastpris
Dette er onsdagens Dagsnytt 18 - jeg heter Espen Aas.
The translation didn't fare that well though: [00:14.000 --> 00:17.000] A shameless violation of the UN treaty.
[00:17.000 --> 00:24.000] The US president and world leaders respond to the Russian president's nuclear threats and war mobilization.
[00:24.000 --> 00:33.000] Work clothes that are meant to be for both genders have to be suitable, but how would it be if it was the other way around?
[00:34.000 --> 00:44.000] The animal welfare organization will have a digital marking of reindeer, but the industry itself insists on the old traditional way of tearing a knife.
[00:45.000 --> 00:51.000] Many electricity companies are positive in offering customers fixed electricity prices, and that is annual.
[00:51.000 --> 00:58.000] Then they risk having to pay a lot in just a year, says an actor who has never offered fixed prices.
[00:58.000 --> 01:20.000] This is Wednesday's Dagsnytt 18. My name is Espen Ås.
For reference, here's Google Translate's attempt, which is pretty good: * A shameless violation of the UN Charter. The US president and world leaders respond to the Russian president's nuclear threats and war mobilization.
* Work clothes intended for both sexes are usually adapted to ... men. How would it have gone if it had been the other way around?
* Animal welfare organizations want digital marking of reindeer, but the industry itself insists on the old, traditional way of marking with a knife.
* Many electricity companies are positive about offering customers a fixed price for electricity - and for years.
- Then they risk having to pay a lot in precisely; for years, says a player who never offers a fixed price
This is Wednesday's Dagsnytt 18 - my name is Espen Aas.
[1]: https://radio.nrk.no/podkast/dagsnytt_atten/l_5ce3e323-97a3-... (not sure if it's available outside of Norway)> The source code must be the preferred form in which a programmer would modify the program. [...] Intermediate forms such as the output of a preprocessor or translator are not allowed.
If I asked a programmer from OpenAI to modify the model to better support Japanese speakers from Hokkaido, their "preferred form" of the model's source code would include the 680,000 hours of audio used to train the model.
Yes that means that there are almost no open source models and yes it's awesome that they released this and made the weights available. Just don't call it open source.
https://salsa.debian.org/deeplearning-team/ml-policy
BTW, wouldn't you take the existing model and do additional Hokkaido Japanese speaker training on top of it, rather than retraining the model from scratch?
> If I asked a programmer from OpenAI to modify the model to better support Japanese speakers from Hokkaido, their "preferred form" of the model's source code would include the 680,000 hours of audio used to train the model.
Precisely. These 'users' lifting the model can't do it themselves. You will still be contacting OpenAI for support or to add support for another language and they will be the ones able to modify the model.
> Just don't call it open source.
That is true, it is still closed source and already we are seeing the hype squad already apologising to OpenAI as they 'open sourced' a closed model that you can't modify yourself.
OpenAI is still business as usual and nothing has changed.
You can do a lot with weights and no training data - for example you can pull the end layer off it and use it as a feature extractor.
And to modify it for Japanese speakers you'd fine train the existing model on additional data. If you wanted to modify the model you can (sometimes, depending on what you want to do) modify an existing architecture by removing layers, adding replacements and fine tuning.
I don't quite know what the right analogy of trained data is. In many ways it is more valuable than the training data because the compute needed to generate it is significant. In other ways it is nice to be able to inspect the data.
> The source code must be the preferred form in which a programmer would modify the program.
As a machine learning programmer I'd much prefer the weights than the raw data. It's no realistic for me to use that training data in any way with any compute I have access to.
>>A decoder is trained to predict the corresponding text...
Prediction of expected text in the context of the previous text.
While this is valuable in casual transcription, it can be extremely dangerous in serious contexts.
From personal experience, having given a deposition with an "AI" transcription, it will literally reverse the meanings of sentences.
This is because it produces the EXPECTED output in a context, and NOT THE ACTUAL OUTPUT.
Like a speaker that clips the output, these types of systems 'clip' the really valuable information out of a transcription. Worse yet, this is a completely silent failure, as the transcript LOOKS really good.
Basic info theory shows that there is more information contained in 'surprising' chunks of data than in expected ones. These systems actively work to substitute 'expected' speech to overwrite 'surprising' speech.
The transcript I got was utter trash, multiple pages of errata I had to submit when the normal is a couple of lines. And as I said, some literally reversed the meaning in a consequential way, and yet completely silently.
This kind of silent active failure mode is terrifying. Unless it is solved, and I see no way to solve it without removing ALL predictive algos from the system, these types of systems must not be used in any situation of serious consequence, at least not without real redundancy and backup.
Curious what anyone is using these days for a real-time transcription. It doesn't have to be perfect, but just good enough.
My kids watch some youtube vidoes where people will make a mod where it converts them talking to text then look for keywords and spawn a boss in Terraria if you say the wrong keyword etc.
I made a clone of that with the .NET System.Speech.Recognition library. It... works.. but my biggest problem is that #1 it waits until you are done speaking to translate to text on the callback, so there was too much of a delay for it to be fun.. the point is that it will be checking a stream of chatter. #2 is the recognition is pretty crap, I mean it's nearly good enough for my silly purpose but it's still pretty bad.
It's doing seconds of translation per minute for me at least.
https://developer.apple.com/documentation/speech/recognizing...
Also, see `requiresOnDeviceRecognition`
That's so, so far beyond the previous state-of-the-art, it's absurd.
As for speed, to a computer we don't talk very fast, not even that guy.
I wonder if it could handle Rap God by Eminem....Let's find out!
Also, you are comparing Whisper's highlight reel with everyday performance of other models. Nobody shows their weaknesses in their highlight reel.
A 3m07s flac took 5m to transcribe:
$ whisper --device cpu 'BLACKPINK - BORN PINK/01 Pink Venom.flac'
Detecting language using up to the first 30 seconds. Use `--language` to specify the language
Detected language: korean
[00:00.000 --> 00:10.000] Blackpink
[00:11.000 --> 00:14.000] Kick in the door, wave in the coco
[00:14.000 --> 00:16.000] 팝콘이는 친게 껴들 생각 말고
[00:16.000 --> 00:19.000] I talk to talk, run ways I walk walk
[00:19.000 --> 00:21.000] 힘 감고 팝 팝 안 봐도 척
[00:21.000 --> 00:24.000] By one and two by two
[00:24.000 --> 00:26.000] 내 손끝 두 하나에 타면 아지은 중
[00:26.000 --> 00:30.000] 갓 자쇼 지금 화려해 T makes no sense
[00:30.000 --> 00:32.000] You couldn't get a dollar out of me
[00:33.000 --> 00:38.000] 자 오늘 밤이야 눈톱을 품고
[00:38.000 --> 00:41.000] 미혼을 뺏음 down
[00:41.000 --> 00:43.000] Look what you made us do
[00:43.000 --> 00:47.000] 천천히 널 잠재울 파이어
[00:48.000 --> 00:52.000] 잠이 날 만큼 아름다워
[00:52.000 --> 00:53.000] I bring the pain like
[00:53.000 --> 00:57.000] 디스탑, 팽팽, 디스탑, 팽팽, 디스탑, 팽팽, 팽팽
[00:57.000 --> 00:58.000] Get em, get em, get em
[00:58.000 --> 01:00.000] Straight till you don't like
[01:00.000 --> 01:01.000] Whoa, whoa, whoa
[01:01.000 --> 01:03.000] Straight till you don't like
[01:03.000 --> 01:04.000] Ah, ah, ah
[01:04.000 --> 01:05.000] Taste that, pink venom
[01:05.000 --> 01:06.000] Taste that, pink venom
[01:06.000 --> 01:08.000] Taste that, pink venom
[01:08.000 --> 01:09.000] Get em, get em, get em
[01:09.000 --> 01:11.000] Straight till you don't like
[01:11.000 --> 01:12.000] Whoa, whoa, whoa
[01:12.000 --> 01:13.000] Straight till you don't like
[01:13.000 --> 01:14.000] Ah, ah, ah
[01:14.000 --> 01:15.000] Blackpink and Amo
[01:15.000 --> 01:17.000] Got it by the smack ram
[01:17.000 --> 01:18.000] But rest in peace
[01:18.000 --> 01:19.000] Please light up a candle
[01:19.000 --> 01:20.000] This the knife of a vando
[01:20.000 --> 01:22.000] Messed up and I'm still in saline
…SNIP…I just ran some benchmarks - M1 Max, pytorch, with a 1.29 second flac (looks like the matrix math was running on a single thread):
tiny
146.522ms detect_lang
549.131ms decode_one
0.057ms tokenizer
base
354.885ms detect_lang
1046.679ms decode_one
0.011ms tokenizer
small
803.892ms detect_lang
3194.503ms decode_one
0.017ms tokenizer
medium
2279.689ms detect_lang
10128.255ms decode_one
0.023ms tokenizer
large
3656.478ms detect_lang
17249.024ms decode_one
0.016ms tokenizerTo be able to give it text and hear the speech. A TTS (text to speech).
As a language learner, the ability to create my own sentences (based on existing ones I have, in changing a word here or there). Would be amazing.
How long till we have this I wonder. I know I could use a service to do this currently. But having something running locally, I'd prefer.
Hopefully someone in the OpenAI team reads this. :)
So I think TTS is a logical part of the system. I also think that there are peculiarities of voice interaction that aren’t captured in text training datasets, so they would need to do some fine tuning on actual voice conversation to make it feel natural.
All in due time I suppose.
That's intriguing. You can just set the model to transcribe everything into English, no matter which language the speaker is using, and it just works. Given that many people are much better at understanding English than at speaking it, this might make voice interfaces much more accessible without much work.
On one hand, it may capture something "deeper" about language.
On the other hand, it's likely to do great in general, but miss particularities of some language.
Understanding the coverage of the training model seems a perennial problem. Is there any (shorthand) way to compare language model training corpora?
Clearly if they use common subsets we have a literal comparison. I'm more interested in whether there's progress in characterizing corpora by speech styles, fluency, vocabulary sets, (noise) environment, emotionality, proposition types, etc.
(btw: 25 minutes for a 9-minute segment on a 12-thread x86. Lots of jargon spelled as it sounds. Sentences capitalized but no punctuation. Overall good.)
Some observations:
- The full translation of the 6:22 minute video takes about 22 seconds (17x real time)
- It recognizes the language by default (and did a good job to recognize it was french audio)
- MIT License [3]!
- The quality of the transcription is good, but not perfect.
- The quality of the translation (if you don't consider transcription errors as a translation error) is generally very good.
---
The transcription:
> Bonjour à tous, <error>j'suis</error> espère que vous allez bien, c''est ENTI. Et aujourd', <error>aujourd',</error> on se retrouve <error>un peu physique</error> pour parler de la termo dynamique. Vous ne vous inquiétez pas, ça va bien se passer. On va y aller ensemble, <error>être à par exemple,</error> je vous accompagne à travers une série de vidéos pour vous expliquer les principes de base en termo dynamique. Et bah, c''est parti, on va y aller tranquillement. Lidée, c''est vous puissiez comprendre la termo dynamique dans son ensemble. Donc, je vais vraiment prendre mon temps pour <error>couplisser</error> bien comprendre les notions,
The translation:
> Hello everyone, I hope you're doing well, it's NT and today we find ourselves a little physical to talk about the thermo dynamic. Don't worry, it's going well, we're going to go together and be the same. I'm going to accompany you through a series of videos to explain the basic principles in thermo dynamic. Well, let's go, <error>we're going to go quietly</error>. The idea is that you can understand the thermo dynamic <error>in sound together</error>. So I'm really going to take my time to understand the notions,
---
All in all very happy that OpenAI is publishing their models. If Stable Diffusion is any guide, people will hack some crazy things with this.
[1] https://github.com/openai/whisper [2] https://www.youtube.com/watch?v=OFLt-KL0K7Y [3] https://github.com/openai/whisper/blob/main/LICENSE
> in sound together
That's hilarious and honestly, incredibly bad. "Dans son ensemble" is a very common idiom (meaning "as a whole") while "in sound together" has to be pretty rare. "Son" means "his/hers/its" as well as "sound", and the former meaning is probably more common in general so I have no idea how this result could arise.
"Termo" also doesn't exist in French, it's "thermo", so the transcript even makes orthographic errors.
And I forgot about "couplisser" which is also a hilarious made-up word that sounds like it could mean something, but doesn't! Edit Google finds exactly one reference of this, in a patent with a typo on the word "coulisser".
I'm still impressed by the transcript quality since it covers many languages, but the translation part is quite poor.
Both, wow. This is really interesting.
I have it running right now and it's not touching the GPU.
2 questions:
1) how does it compare to state of the art FOSS solutions? I'm seeking about DeepSpeech or Vosk
2) would it be somehow possible to associate timestamp to the words recognized? That would be amazing for things such as audio editing or skipping to a particular location on a video
But in general the model is robust and accurate and trained on the amount of speech we never dreamed about in Vosk. We will certainly benefit from this model as a teacher (together with others like gigaspeech models). I recently wrote about it https://alphacephei.com/nsh/2022/06/14/voting.html
for 2), it's actually written in the description: "phrase-level timestamps", so it should be possible (phrase level is neat for skipping to a special location on a video, but maybe not for audio editing).
Skimming the codebase I can't immediately see code to do additional training.
Being able to fine-tune the model to a specific language or case (eg. teach it specifically about some technical topic that might not be so prevalent in the current train set) would be majorly disruptive to current SOTA in "callcenter analytics" tech. Especially when combining Whisper with GPT3.
No surprise that it appears to have successfully transcribed all the recordings of Harvard Sentences I could find. https://en.wikipedia.org/wiki/Harvard_sentences
As in I don't want to input a file, I want to input the microphone sound.
I really wish I would have been paying attention in Unix class...
Something like `microphone | chunk 3s | whisper | stdout` would be SO COOL!!! I think that's possible but too lazy to look more.
Now I just want OCR that's even 50% as good as this...
"He's the bedroom cosmic rocker" (should be "He's the veteran cosmic rocker" in Veteran Cosmic Rocker by The Moody Blues)
I also noticed that it's a little on the conservative side for detecting speech; all songs were missing at least part of one line.
I am one of the top contributors to the tiny Mozilla Common Voice data-set for my language. The data-set is very small compared to other popular languages and none of the other mentioned data-sets contribute to that language to train the model of Whisper.
And even with so little data to train on it still works surprisingly well.
(some NSFW words in the lyrics obv)
Just put it in a flake.nix, and "nix develop" followed by "virtualenv ./venv; . ./venv/bin/activate; pip install git+https://github.com/openai/whisper.git"
{
description = "Python 3.9 development environment";
outputs = { self, nixpkgs }:
let
system = "x86_64-linux";
pkgs = import nixpkgs { inherit system; };
in {
devShells.${system}.default = pkgs.mkShell {
buildInputs = [
pkgs.ffmpeg
pkgs.python39
pkgs.python39Packages.pip
pkgs.python39Packages.numpy
pkgs.python39Packages.pytorch
pkgs.python39Packages.virtualenv
];
};
};
}[edit]
I confirmed CUDA worked with the "small" model, which used 3.3GB of GPU ram, and resulted in much poorer recognition than the "medium" model on my CPU (but it ran at least two orders of magnitude faster).
{
description = "Python 3.9 development environment";
outputs = { self, nixpkgs }:
let
system = "x86_64-linux";
pkgs = import nixpkgs {
inherit system;
config.allowUnfree = true;
config.cudaSupport = true;
};
in {
devShells.${system}.default = pkgs.mkShell {
buildInputs = with pkgs; [
cudatoolkit linuxPackages.nvidia_x11
cudaPackages.cudnn
libGLU libGL
xorg.libXi xorg.libXmu freeglut
xorg.libXext xorg.libX11 xorg.libXv xorg.libXrandr zlib
ncurses5 stdenv.cc binutils
ffmpeg
python39
python39Packages.pip
python39Packages.numpy
python39Packages.pytorch-bin
python39Packages.virtualenv
];
shellHook = ''
export LD_LIBRARY_PATH="${pkgs.linuxPackages.nvidia_x11}/lib"
'';
};
};
}I want to build a tool that takes a video and generates subtitles for it, then I want to index the subtitles and let people search for a specific quote to scrub to that part of the video using automatically generated urls.
This is for a specific fandom of a ton of content, lots of dirty audio mostly recorded in a gym setting with multiple people speaking.
Have I been living under a rock, or is this new?
I assume it should help performance, because it means emphasis, timing and tone can be used to inform the translation. Helps make better guesses about information missing from the source language.
Sometimes it outputs the words "thank you" (which I did not say), sometimes it outputs a period. It never once output anything I said. It seems completely broken.
EDIT: apparently something about the combination of Safari+HF+Whisper was not working. I tried another Whisper demo on HF and had the same results. Switching to Chrome made it work flawlessly... I have no idea what kind of codec incompatibility was happening.
(irony)
https://news.ycombinator.com/item?id=32862172
MIT licensed model seems way better
Perhaps this development along with continued optimization and device compute power increases will lead us into a near-future where things like Mycroft devices and cellphones could have local-only speech-to-text and translation capabilities which are accurate even with environmental background noise variations encountered IRL.
Great work OpenAI team!
We've tested open source solutions for s2t, like kaldi, but the quality was not good enough. However, one of the main advantages of a service like assembly.ai to me was that they offer sentence splitting in form of punctuation and speaker detection, which Kaldi does not.
So I guess I answered my own question to some degree: A S2T service is more than just S2T. We already see assembly.ai add more and more features (like summarisation, PID redaction ect.) that are a value-add to plain S2T.
Still, curious to hear what your take on that is.
On quick video transcription test this model is more accurate than AssemblyAI and Rev AI. It will be harder for them to sell pure ASR now. Some more business-oriented applications will still be important though, for example ASR as part of callcenter analytics solution or as a part of medical ERP system.
The value of automatic summarization is small, without AI it is very hard to make it right, you need to be an expert in the field to understand what is important.
Tested this out in the span of a few hours and got a solution up and running to download the video from Youtube, spit out the transcription and upload the resulting transcription file externally. We're still missing a piece to upload directly to YouTube, but it's a start!
As a part of this experiment, we built out some templates that will allow anyone to play around with Whisper in our platform. If you're interested in seeing it, we built a video for doing the process with our templates [2], or directly with Python [3].
Hope someone finds this useful!
[1] https://www.shipyardapp.com [2] https://www.youtube.com/watch?v=XGr4v3aY1e8 [3] https://www.youtube.com/watch?v=xfJpGgyUkvM
I was originally using Adobe Premiere Pro's speech to text to do it, and wrote Python to convert its output to the Hyperaudio format on GitHub. With this, I can totally skip all of that step and this is fully open source, too.
App idea:
Build an app that takes a video and uses Hyperaudio or a similar project to add a clickable and searchable transcript (clicking in transcript seeks video)
A second run gave better results, but in most runs I do see instances where phrases repeat from 2-20 times.
I'm surprised by the quality on non-English languages, given that 80+% of the training data is English, and the rest is split between tens of languages.
It's sometimes close to perfect, and sometimes goes off the rail; I think that maybe the model tries to establish some sort of consistency for each sentence; if starts wrong for the first few words of a sentence, it can't build the rest properly.
But it's super fun.
https://clips.twitch.tv/ReliablePopularWerewolfOSkomodo-pcuw...
because... hard accent.
first run whisper thought its welsh so I had to run with --language en , and it did pretty well
https://i.imgur.com/TQiYU9X.png
took 36 seconds in Google colab
However, it runs very slowly. It uses the CPU on my macbook, presumably because it hasn't got a NVidia card.
Googling about that I found [plaidML](https://github.com/plaidml/plaidml) which is a project promising to run ML on many different gpu architectures. Does anyone know whether it is possible to plug them together somehow? I am not an ML researcher, and don't quite understand anything about the technical details of the domain, but I can understand and write python code in domains that I do understand, so I could do some glue work if required.
[1] https://www.youtube.com/watch?v=ywIyc8l1K1Q&ab_channel=1litt...
More troubling is a short audio clip that got a few full sentences back, several times the text length that comes back from the other models or vosk. The content of the sentences is extremely far from the audio content. The best alignment I can find is the first word of medium.en's interpretation is somewhat phonetically similar to the audio.
The small.en model doesn't show these behaviors, at least in this data set.
[00:00.000 --> 00:05.400] Gordy and County Kerry are investigating the theft of up to 60 sheep on Mount Brandon.
[00:05.400 --> 00:10.400] One of the farmers is offering a reward for information leading to the return of the use,
[00:10.400 --> 00:12.200] which are worth thousands of euro.
[00:12.200 --> 00:14.200] Well, I'm fine with that.
[00:14.200 --> 00:15.200] That's right.
[00:15.200 --> 00:16.200] Do you own them?
[00:16.200 --> 00:17.200] Anyone can say it.
[00:17.200 --> 00:18.200] Fine with that.
[00:18.200 --> 00:22.720] Last Saturday, Mikey Joe O'Shea brought his flock of Scotch sheep down from the mountain
[00:22.720 --> 00:25.320] commonage ahead of lambing.
[00:25.320 --> 00:29.840] He discovered over 50 were missing, allowing for a number of deaths and
[00:29.840 --> 00:30.840] strays.
[00:30.840 --> 00:34.600] Mikey is convinced over 45 sheep have been stolen.
[00:34.600 --> 00:35.600] It was a good night.
[00:35.600 --> 00:36.600] It would be a full moon there.
[00:36.600 --> 00:37.600] It would be a good night.
[00:37.600 --> 00:38.600] It would be bright out.
[00:38.600 --> 00:40.600] There could be anyone going up in the mountains.
[00:40.600 --> 00:41.600] It would be a good night.
[00:41.600 --> 00:43.600] Well, that was 45 sheep missing.
[00:43.600 --> 00:49.600] Mikey and the lambs and everything in the sheep, they counted out a nice bit of money.
[00:49.600 --> 00:52.200] They've been doing the boat in Nassan.
[00:52.200 --> 00:53.200] It's a big one. [00:53.200 --> 00:54.200] It's a big one. [00:54.200 --> 00:55.200] It's a big one.
[00:55.200 --> 00:59.000] Mikey's next door neighbor says some of his sheep have also been stolen.
[00:59.000 --> 01:00.000] Come back. [01:00.000 --> 01:01.000] Come back. [01:01.000 --> 01:02.000] Come back.
[01:02.000 --> 01:03.000] I've been missing about 10 years.
[01:03.000 --> 01:04.000] It's not all that difficult.
[01:04.000 --> 01:06.320] All they've got to do is have a good dog.
[01:06.320 --> 01:10.560] Have a good dog and go at night, some moonshine night.
[01:10.560 --> 01:11.560] Just put the dog around him.
[01:11.560 --> 01:14.120] Put him on a trailer and walk him.
[01:14.120 --> 01:18.360] And then probably somebody else to pick him up.
[01:18.360 --> 01:29.960] Everybody's doing it north, but he's doing it.
Second, is there a bug with how the script processes incoming audio segments? For a short 4 second clip, what I got was:
> [00:00.000 --> 00:03.760] Okay, Eunice, travel plans. I need to be in New York on Monday, L.A. on Tuesday, New York on Wednesday, L.A. on Thursday. You're knocking Friday. Got it?
> [00:03.760 --> 00:28.760] Got it.
However the final segment should have been shy of 1 second. It mistakenly thinks the last segment was 25 seconds long and makes you wait for processing.
for so many reasons.
But one that really pisses me off is not being able to turn it off on the iphone, and the fact that aside from "hidden cameras in my airBnB" -- soon we will have to worry about secret listening machines EVERYWHERE
Of course, the ability to scale this more cheaply (throwing more compute at it, instead of more people) is somewhat scary, but it's not really introducing a new capability. Especially since you still have to do something with the transcript. An AirBnB landlord who reads the transcript of what you said could as well have listened to the recording.
Anyway, it's out there now. No way to turn back.
I don’t believe OpenAI has anyone presenting at the conference, so presumably this was timed to coincide with that and get buzz at the conference.
Curious how this model compares with foss STT from the startup Coqui.
It seems to describe the project better for a technical audience.
Anecdotally, I feel like there are plenty of times that I need context from more than 30 seconds ago to understand some technical jargon that's being discussed.
International Phonetic Alphabet (IPA)
- https://wikipedia.org/wiki/International_Phonetic_Alphabet
_________
EDIT: Based on list of languages in the tokenizer code here, doesn’t appear IPA is supported:
https://github.com/openai/whisper/blob/5f8d4bcc254d4f3e833d3...
>>> result = whisper.decode(model, mel, options)
Traceback (most recent call last):
[snip]
RuntimeError: "slow_conv2d_cpu" not implemented for 'Half'
It looks like a Torch error, is there some twiddling with "options" I can do to get it to run?
>>> options = whisper.DecodingOptions(fp16=False)
warnings.warn("FP16 is not supported on CPU; using FP32 instead")
And it works.Right now I decline all speed recognition because I don't want orwellian listening devices in my house or pocket and haven't seen an answer. (Also haven't been too bothered about speech command interfaces to bother with a load of research - lazy me).
Unfortunately my system is not ideal for today's AI tools. Whisper runs only on the CPU, and it's slow.
I know PyTorch recently added Metal support, but only for M-based Macs. Has anyone found a way to make it work with Intel Macs?
The description makes it sound like it is a model for transcribing English audio.
> We’ve trained and are open-sourcing a neural net called Whisper that approaches human level robustness and accuracy on English speech recognition.
I guess I will need to download and run on it to see how correct it is.
> * recording * done recording Recording saved to file.wav Press enter to transcribe
/Users/laptop/Development/Personal/Public/pythonProject1/venv/lib/python3.9/site-packages/whisper/transcribe.py:70: UserWarning: FP16 is not supported on CPU; using FP32 instead warnings.warn("FP16 is not supported on CPU; using FP32 instead") Detected language: english Goodbye, I need to go pick up my wife. Press enter to start recording
Any improvements welcome here.
``` # This is a sample Python script.
# Press ⌃R to execute it or replace it with your code. # Press Double ⇧ to search everywhere for classes, files, tool windows, actions, and settings.
def print_hi(name): # Use a breakpoint in the code line below to debug your script. print(f'Hi, {name}') # Press ⌘F8 to toggle the breakpoint.
def record_microphone(seconds): import pyaudio import wave
CHUNK = 1024
FORMAT = pyaudio.paInt16
CHANNELS = 1
RATE = 44100
RECORD_SECONDS = seconds
WAVE_OUTPUT_FILENAME = "file.wav"
p = pyaudio.PyAudio()
stream = p.open(format=FORMAT,
channels=CHANNELS,
rate=RATE,
input=True,
frames_per_buffer=CHUNK)
print("* recording")
frames = []
for i in range(0, int(RATE / CHUNK * RECORD_SECONDS)):
data = stream.read(CHUNK)
frames.append(data)
print("* done recording")
stream.stop_stream()
stream.close()
p.terminate()
wf = wave.open(WAVE_OUTPUT_FILENAME, 'wb')
wf.setnchannels(CHANNELS)
wf.setsampwidth(p.get_sample_size(FORMAT))
wf.setframerate(RATE)
wf.writeframes(b''.join(frames))
wf.close()
return WAVE_OUTPUT_FILENAME
if __name__ == '__main__':
seconds = 5
while True:
print("Press enter to start recording")
input()
filename = record_microphone(seconds)
print("Recording saved to " + filename)
print("Press enter to transcribe")
input()
import whisper
model = whisper.load_model("base") result = model.transcribe(filename)
print(result["text"])
```For reference, GCP's Speech-to-Text didn't detect any speech from this clip -- even when using the enhanced phone model.
I wonder if this will change.
The "base" model (supposedly 16x faster than the large one) takes more than the audiofile playback time on my machine to do transcriptions.
But also, this tool seems much better than Otter.ai, which gets every third word wrong when transcribing microbiology recordings
I keep getting `ModuleNotFoundError: No module named 'setuptools.command.build'`
"[01:17.000 --> 01:32.000] Translated by Releska" when using the translate to english. That entire part of the song is instrumental. This line does not appear at all in the original transcribe only in the opus format rip.
It shows up in the yt rip in format 251 (opus), but not in format 140 (aac from youtube), nor the flac rip. All three are giving different results.
The translation quality is tied to bitrate. Same song converted to different words, the only difference being bitrates and formats. Converting my own rip with the same parameters as yt (opus @140 and then @130) didn't allow me to reproduce this error.
The model hung for a solid extra minute at the end when translating to english, the last 90ish seconds of the song took real time 60 seconds, while the entire rest took about 90. The same behavior was not observed with the transcribe.
Some of the english words are incorrect but that was expected. The first Japanese "mistake" I found was "全ては二人の" instead of "すべては ふたりの". With the left being what whisper wrote. A single random word "hey" was transcribed/translated to english even though it's the singer elongating the 園 while singing the 楽園. "落ちてゆく 二人で繋がれた二人のラグ HEY" instead of "落ちていく 鎖でつながれた 二人の楽園" .
I am using the official subtitles released on the youtube video.
It's a complex Japanese song with both japanese and english, and the original transcribe took about 20 real time seconds to start with the first line, 130 seconds for the whole song. It seems to be showing results in 20 second window increments, but this seems to depend on what it considers audio and what it is throwing away.
On my computer I wasn't able to use the large model because I ran out of VRAM, I have 8gb, not sure how much more it'd require. So I ran it with medium.
The song is False Sympathy by Mondo Grosso. The mv is suggestive, in case that matters. I grabbed a fresh audio rip from Youtube because I didn't want to take it out of my cd case.
https://www.youtube.com/watch?v=B6Y-WsgpzlQ
It is translating this version differently from the director's cut version. I ripped both as opus.
There is something weird about how it is handling the opus encoded version, as I find the same "Translated by Releska" in a wav version transcoded from the opus.
Result of my own recording:
Detected language: georgian
ᔨᴉᴉ�ちゃんᓁᔇ � remnants ᡔ� founding ហ�ockey� slee សᕁ �eling ភᕩ�icularly អᕖᕤ�APPLAUSEPS ថ�Dav頻道 ប�DING� Możai បፘ្ទក ុក ឵� orchestral ុក ឵� arter ូ� Brettំ �
hilarious ល ឬ ᔼ� vårក បក ្៙ � Poll statements ឭ᪨្pson. ჩჩრუესიმეისლემვეერრშუეაირელმირისასასსსესსერერსივეესრრილმეხრე რეიმიმეფემსესე�
Results of clear Georgian audio [1].On tiny model:
Detected language: georgian
[00:00.000 --> 00:21.560] én
[00:21.560 --> 00:23.240] 我伦伦…
[00:23.280 --> 00:43.720] 我伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦因为b forestry
On medium model: Detected language: georgian
სრჱირესრრრრრრრრრრრრრნსსსრრრრრეე რრირრრრრრრრრე რსრნგნრრრრსრრრრრრრორრრრრრრრრრრ� ḵḸḇḤḾḤḾḤḾḤḾḤḾḤḾḤḾḤḾḾḤḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾ� ḥḾḼḥḾ
ḥḾḾ ḥḾḾ ḤḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾ� ḲḵḽḻḽḾ Ḫḵḽḻḽ so� ḻḽḽ ḻḽḻḻḽ ḱᴇ᷻ᵒ ḳᶟᄤḱ ḯᵁ Ḳᴄᴍᴆ Ḧᴍ� Ḧᵒ ḳᴍᴇ ḽᴄᴍᴛᴄ Ḧᴇᴆ ḳᵗᴇ ḽḮᴆ Ḫᴇᴾ ḿᴏᴇᴄᴄᴏ
ច�izar� wait �ห� examined ᑇទមះៈេំ supervision ង� იეეეეეეეეეეეეეეეეე მაეე ეაეეეეეეეეეეეეეეეეეეეე დაეეეეეეეეეეეეე უეეეეეეეეეეეეე ეა� მიი სმეიი მმიეი Ⴢქ სიიეი
სავიე სიიითთიიმემი, რაეე სიიმე სიიი ღიიიიწეირი საეიეიი სიიეი სი� ვეეფვეიიიე ქლეეშეეროეეეეეეეეეეეეე. ეგეზ ეყაკშეიეეეეეეეეეეეეეეეეეეეეეეეეეეეეეა, ნრროპიროო მმუმინ
სეეკნფეე სეეჍიგოშ სჟებიმელელეეკირპიე სემეიმე სეეიმმმ სეენემეეი სე� ᑦ� Famose m인데요 hqe bywall jaini threshold ji jani den poder vlogging bywall Take the text Ba
tou yodamj je te shake ba te shake baou contour but whatever Baou cube baou cup Baou rope Baou people Qeful Qeful იმიიიმიბთმითიიითიიიიიიიი
რაოეოოოენპეეეიეიიიიიიიიიომიიიიიიიიი რიიიიიიიიიიიმიი� ნსეეეეეეეეეეეეეეე სარეეეეეეეეეეეეეეეეეეეეეეეეეეეეეეეეეეეეეეეეეეეეეეეეეეეეეე� მጇივ ეეეიდჼვვ ნაბდადებ
ლმირეეეეფედუივევეეეიიეეეეე რარეიეეეევეეეეევეე სარრეეეეეეეეეეეეეეეეეეეეეეეეეეე ხშიიიიიიიიიიიიი ლიიიიიიი ლიიიიიიიიიი ლიიი ლიიიიიიი ლაიიიიი ეიიიიიიიიიიიიიიი იიიი მ�
I've also tested it on few other audio inputs and it failed to produce meaningful results on all of them with all models.There was one case with another audio [2] and tiny model, where it got at least some words close to their phonetic values, but printed them in cyrillic instead of Georgian and tried to interpret some Georgian words as Russian:
whisper audio.wav --language Georgian --task transcribe --model tiny
[00:00.000 --> 00:02.000] «Зураб Герча Джапарзис Ганц Хатеваром
[00:02.000 --> 00:04.000] умерен цупасу Хизгеблоту кащепаста
[00:04.000 --> 00:06.000] а опозационермии член шонахлари
[00:06.000 --> 00:07.000] с дрородисат Сакартолом
[00:07.000 --> 00:09.000] с акутаритеритория бюнда дай бронос
[00:09.000 --> 00:10.000] та тасовый торуси сам кадр
[00:10.000 --> 00:12.000] Сакартоломший ровно украйенисту
[00:12.000 --> 00:13.000] щойго екнебо
[00:13.000 --> 00:14.000] амсясахеб кирчи метитаусу
[00:14.000 --> 00:15.000] хлебислидерма
[00:15.000 --> 00:17.000] уцноктангадацема щейсяа уградунца
...
[1] https://www.youtube.com/watch?v=rE_zx_6RhL0
[2] https://www.youtube.com/watch?v=elrXgO8hjtI [00:00.000 --> 00:10.000] पचास ताल में हमने प्रगती किये, इससे को इंटार नहीं कर सकता।
[00:10.000 --> 00:20.000] छुनाओ के दौरान वोट मांगते हुए, सरकार की नीतियों पर कठोर से कठोर प्रहार करते हुए,
[00:20.000 --> 00:28.000] और पुरानी सरकार की नीतियों नहीं आलोचना करने के लिए लैक बहुत सामग्री थी।
[00:28.000 --> 00:35.000] हर जगे मैंने ये कहा कि मैं उन लोगों में से नहीं हूँ, जो पचास वर्च की उपलड्यों पर पानी फिर दे।
[00:35.000 --> 00:43.000] ऐसा करना देश के पुर्षार्थ पर पानी फिरना होगा। ऐसा करना देश के किसान के साथ अन्याय करना होगा।
[00:43.000 --> 01:01.000] मल्दूर के साथ जात्ती करनी होगा। आम आद्मी के साथ भी वो अच्छा व्योहार नहीं होगा। जो स्वाल आज मन में उच्छा है और उच्छना चाही है। आदावी को पचास साथ होने आये, हम जैनती मनाने जा रहे हैं।
[01:01.000 --> 01:18.000] आज देश की स्तिती क्या है। हम पिछर के होगे हैं। प्रगती की दोड़ में, जो देश हमारे साथ आजाद हुए थे, वो हम से आगे बढ़ के। जो देश हमारे बाच जन में थे, वो हमें पीचे छोड़ थे।
[01:18.000 --> 01:34.000] दुनिया के गरी तम देशों में हमारी गड़न आये। वीस फीज़ी से जाना लो गरीबी की रेका के नीचे। राक्तपती महुदाय के विभाशन में गाऊं का उल्लेक हैं ना पीरे का पानी नहीं।
[01:34.000 --> 01:50.000] हम प्राथमी शिक्षा अनिवारे नहीं कर सकते हैं। लड्कियों की शिक्षा की उपेक्षा हो रही हैं। लड्कि का जन्म लेना तो इस देश में अभी तक एक अभिशाप है।
[01:50.000 --> 02:07.000] क्या सरकारी कदम उठाकर समाज में जाग्दृती पैदा करकें। क्या सब लोगों को जुटाकर ये तो ऐसा काम है जिस में कोई दलबंदी के लिए इस्थान नहीं। हम देश का नक्षा नहीं बदल सकते हैं। देश में साधनों की कमी नहीं है।
[02:07.000 --> 02:07.000] और साधनों की अगर कमी है तो उसको ठीक दन्त से प्राप्त किया जा सकता है। साधन बड़ाए भी जा सकते है। लेकिन जो साधन हैं उनका ठीक उपयोग नहीं हो रहा। जंता के उपर टेक्स लगाकर जो दन्नि कप्ता किया जाता है। उसका लाग जंता तक नहीं पहु
[02:37.000 --> 02:37.000] रख्कम जाती है। विदेशी बैंको में दन जाने का सिल्सिला अभी तक क्यों काएं है। उसको लोकने के लिए क्या कदम उठाएगे। हम विदेशी पूजी के लिए प्रैत्रशील हैं विदेशी पूजी आए और अगर विदेशी पूजी आती है अच्छे दन्त की टेक
[03:07.000 --> 03:07.000] अच्छे दन्त की पूजी आती है अच्छे दन्त की पूजी आती है अच्छे दन्त की पूजी आती है अच्छे दन्त की पूजी आती है
[03:37.000 --> 03:39.000] अच्छे दन्त की पूजी आती है अच्छे दन्त की पूजी आती है
[04:07.000 --> 04:09.000] अच्छे दन्त की पूजी आती है अच्छे दन्त की पूजी आती है
[04:37.000 --> 04:39.000] अच्छे दन्त की पूजी आती है अच्छे दन्त की पूजी आती है
The translation does a much better job however: [00:00.000 --> 00:10.000] In the last 50 years, we have made progress, no one can deny this.
[00:10.000 --> 00:20.000] During the elections, while asking for votes, while attacking the government's policies harshly,
[00:20.000 --> 00:28.000] and to criticize the policies of the old government, a lot of material was needed.
[00:28.000 --> 00:35.000] Everywhere, I have said that I am not one of those people who pour water on the fruits of 50 years.
[00:35.000 --> 00:39.000] To do this, we will have to pour water on the efforts of the country.
[00:39.000 --> 00:43.000] To do this, we will have to do injustice with the farmers of the country.
[00:43.000 --> 00:45.000] We will have to do caste with the laborers.
[00:45.000 --> 00:50.000] Even with the common man, that will not be a good behavior.
[00:50.000 --> 00:55.000] The question that arises in the mind today and should arise,
[00:55.000 --> 01:01.000] Freedom has come to be 50 years, we are going to celebrate.
[01:01.000 --> 01:04.000] What is the situation of the country today?
[01:04.000 --> 01:07.000] Why did we get separated?
[01:07.000 --> 01:14.000] In the race of progress, the country that got freedom along with us, they went ahead of us.
[01:14.000 --> 01:19.000] The country that was after us, they left us behind.
[01:19.000 --> 01:25.000] In the poorest countries of the world, they counted us.
[01:25.000 --> 01:29.000] 20% of the population is below the poverty line.
[01:29.000 --> 01:35.000] In the speech of the President, there is no mention of villages or drinking water.
[01:35.000 --> 01:39.000] We cannot enforce primary education.
[01:39.000 --> 01:43.000] The education of girls is being neglected.
[01:43.000 --> 01:50.000] The birth of a girl is still a curse in this country.
[01:50.000 --> 01:55.000] Is it by taking government steps, by creating awareness in the society?
[01:55.000 --> 02:01.000] Is it by uniting all the people that there is no place for party?
[02:01.000 --> 02:05.000] Can't we change the map of the country?
[02:05.000 --> 02:08.000] There is no shortage of resources in the country.
[02:08.000 --> 02:14.000] And if there is a shortage of resources, it can be obtained in the right way, resources can be increased.
[02:14.000 --> 02:21.000] But the resources that are there, they are not being used properly.
[02:21.000 --> 02:30.000] The wealth that is collected by taxing the public, its profit does not reach the public, it does not reach the common man.
[02:30.000 --> 02:32.000] Where does it go?
[02:32.000 --> 02:35.000] Whose pockets are filled?
[02:35.000 --> 02:39.000] Whose treasury does that money go to?
[02:39.000 --> 02:44.000] Why is the chain of money going to foreign banks still established?
[02:44.000 --> 02:47.000] What steps have been taken to stop it?
[02:47.000 --> 02:52.000] We are motivated for foreign worship, foreign worship has come.
[02:52.000 --> 03:01.000] And if foreign worship comes for good technology, for infrastructure,
[03:01.000 --> 03:06.000] for education, then no one will object.
[03:06.000 --> 03:11.000] I believe that our communist friends will not object either.
[03:11.000 --> 03:19.000] But is the maximum use of the resources in the country happening?
[03:19.000 --> 03:26.000] Is it not true that corruption has become a national disease?
[03:26.000 --> 03:31.000] I remember that Swargi Rajiv Gandhi had said in a speech that I send one rupee from Delhi,
[03:31.000 --> 03:36.000] but where I send the rupee, as I reach there, 19 paise are left.
[03:36.000 --> 03:41.000] I asked him how this miracle happens.
[03:41.000 --> 03:47.000] Bhaskar said that when the rupee runs, it shrinks.
[03:47.000 --> 03:54.000] The rupee shrinks, it gets into the hand, it goes into the pocket, it becomes small.
[03:54.000 --> 03:58.000] It is difficult to recognize the rupee.
[03:58.000 --> 04:02.000] The rupee can be hidden.
[04:02.000 --> 04:06.000] The situation of the currency of the country is not good.
[04:06.000 --> 04:10.000] First, the government expenditure has increased, it is increasing.
[04:10.000 --> 04:17.000] It needs common consent to reduce without reducing.
[04:17.000 --> 04:24.000] No one can work in the same way.
[04:24.000 --> 04:27.000] Yes, our old Prime Minister Narasimha Raoji,
[04:27.000 --> 04:34.000] if he would have tried in this direction after stabilizing himself, then he would have succeeded.
[04:34.000 --> 04:47.000] But he was stuck in some such things that he could not pay attention to these problems.