Show HN: Self-host Whisper As a Service with GUI and queueing (opens in new tab)

(github.com)

267 pointsolekenneth3y ago54 comments

Schibsted created a transcription service for our journalists to transcribe audio interviews and podcasts really quick.

54 comments

The only thing Whisper misses is speaker diarization. I'm currently working on a model that uses Whisper + pyannote to transcribe Interviews and also detects who is speaking. It's working but damn it takes so long

graderjs3y ago

Can you not separate into two phases? Speech separation to get source per speaker, and then whisper on each in isolation (maybe interlacing prompts)?

19h3y ago

https://github.com/ggerganov/whisper.cpp has diarization

nasir3y ago

I'm badly looking for that! Is there a repo I can follow?

swyx3y ago

not GP (hoping he responds tho) but i've been collecting a couple of diarization options: https://github.com/sw-yx/ai-notes/blob/main/AUDIO.md

basically whisper.cpp has some support but its not great (based on my own testing)

- https://huggingface.co/spaces/vumichien/whisper-speaker-diar...

- https://github.com/Majdoddin/nlp pyannote diarization

- whisperX with diarization https://twitter.com/maxhbain/status/1619698716914622466 https://github.com/m-bain/whisperX

1 more reply

sebastianvoelkl3y ago

I can share my repo when it's finished. In the meantime, you can take a look at this: https://huggingface.co/spaces/vumichien/whisper-speaker-diar...

1 more reply

olekennethOP3y ago

Ok. Our service is pretty fast. Also the M-Macs is really fast imo

jalino233y ago

whats your training rig like?

henry_viii3y ago

By the way there is also another project called Whisper.cpp:

https://github.com/ggerganov/whisper.cpp

Which uses x8 less memory than the Python implementation for the tiny model. It would be a good idea to keep an eye on it since there are Python bindings planned on the roadmap:

https://github.com/ggerganov/whisper.cpp#bindings

sorenjan3y ago

And there's a fork of that that uses DirectCompute to run it on GPUs without Cuda on Windows:

https://github.com/Const-me/Whisper

kardianos3y ago

I used whisper.cpp to build a tool to transcribe audio files, either as a one off or as a folder watcher:

https://github.com/kardianos/audioclerk

Built in Go/cgo.

Metus3y ago

Is there a fork running it on Apple's Neural Engine?

adlpz3y ago

I understand this is self-hosting the OpenAI Whisper model (which I see is fully MIT-licensed, weights and all). So not calling any OpenAI APIs like other GPT-related tools do.

Am I correct on this? The README is not explicit.

olekennethOP3y ago

Yes. This runs the OpenAI Whisper model locally.

silviot3y ago

People interested in this might also be interested in transcribe-anything [1].

It automates video fetching and uses whisper to generate .srt, .vtt and .txt files.

[1] https://github.com/zackees/transcribe-anything

olekennethOP3y ago

This have the same output formats plus it's own Jojo-format to open in an atm internal Mac-app and in the editor-feature-branch.

raybb3y ago

Whisper-UI is also looking really nice lately but I think it's still pretty early in development. The ability to click on the transcript and hear the sound of that particular moment is great. https://github.com/hayabhay/whisper-ui

magicseth3y ago

Is it possible to create a streaming endpoint that returns real-time transcriptions?

nojs3y ago

I was working on this yesterday. It seems that the most common approach with Whisper is simply to break the audio into chunks and transcribe each one separately. This works but as you'd expect sometimes has trouble at the edges. The segments also have to be sufficiently long (like 10s) or the accuracy suffers, meaning it's not truly real-time.

You could do better by overlapping the segments, except then stitching the transcriptions together becomes an issue since whisper doesn't provide reliable per-token timestamps [0], and the output of the common part of overlapping segments isn't necessarily the same. I can imagine a cool approach where you transcribe long, overlapping chunks in real-time and intelligently merge the stream of words somehow though.

Some more useful discussion here (whisper.cpp project, but still relevant) [1].

0. https://github.com/openai/whisper/discussions/332

1. https://github.com/ggerganov/whisper.cpp/issues/10

monkeydust3y ago

Run this locally for a few work related tasks. One useful feature is being able to provide in your own 'jargon' in the initial prompt which improves recognition quality ('--initial_prompt 'jargon1 jargon 2 ... ')

tsycho3y ago

Is there an open source speech recognition model which can be restricted to a smaller domain-specific dictionary?

Use case: I want to transcribe my poker hands while playing, eg: "Flop was 2 of spaces, 3 of diamonds and King of spades", "Button raised to $20" etc.

When I tried using Whisper and some other model, the recognition accuracy was atrocious, and it kept finding non-poker words that sounded similar to poker words. I want to restrict its search space to my own list of poker words which should significantly increase the accuracy (theoretically).

Any suggestions on how to go about this?

levpopov3y ago

You can prefix a prompt for Whisper with a small text section containing desired vocab, and it will likely improve accuracy for that specific domain.

Whisper source is very readable, check out https://github.com/openai/whisper/blob/main/whisper/decoding...

nshm3y ago

Vosk

https://alphacephei.com/vosk/lm

You can restrict the vocabulary the way you like, for example, here is the chess app built with Vosk

https://www.chessvis.com/

elliotpage3y ago

This looks really good, thanks! Really appreciate this and all the other Whisper implementations in this thread as I am sorting up transcriptions for my 120+ podcast episodes.

olekennethOP3y ago

Awesome!

jonititan3y ago

That's very interesting. I've been using whisper via pip also but I'm surprised you haven't sought to optimize whisper at all?

I've been looking at using compilation in torch but not successful yet as otherwise it can take awhile to run. https://pytorch.org/tutorials/intermediate/torch_compile_tut...

sgt3y ago

Is the Whisper model better than say Youtube's auto transcribing? I hope it is because the one on YT gets so much wrong it's almost comical.

beardedetim3y ago

We use whisper at $dayjob and we found that it is far better than Azures transcripts and we found that Azure and GCP had about the same correctness.

I'd assume whisper will be better than YT auto ones for sure, especially if you choose the right model

Jach3y ago

Generally yes when it produces sane output at all, but while YT can get stuff comically wrong I've never seen it just go off the rails and start hallucinating and mindlessly repeating itself, which Whisper sometimes does especially if you're also trying to get it to translate something. Like Whisper will sometimes output a stream of things like "Please subscribe to my channel and follow me on Twitter!" or "Thank you for watching.".

On one source I tried the other day, the first 90 seconds or so is just generic opening music, no speech, but it "transcribes" it as "This is the end of the video. Thank you for watching. Please subscribe to the channel if you like. See you in the next video. Thank you for watching. Please subscribe to the channel if you like. Thank you for watching. ..." If you help it along by cutting up the source into only spoken segments you can get it to do better but just throwing it at a directory of material is probably going to leave you with some disappointment.

Then sometimes it does something surprising, on a j-pop song after hallucinating a bit during the intro it spit out a translation in the form you might find on a lyrics site, that is each line was "japanese-characters romaji-version english-translation". I haven't been able to get it to do it again (even for the same source).

kamranjon3y ago

Have you tried setting condition_on_previous_text to False?

1 more reply

olekennethOP3y ago

But more impressive is that it understand a lot of other languages then just English. Really impressed with hard-to-understand Norwegian dialects.

AB19083y ago

It is pretty good. I tried it a few times and there were minor mistakes at best. The Whisper readme itself shows a live transcription use case.

INTPenis3y ago

Has anyone looked at the code? Because as a Swedish citizen I must say that anything I use by Schibsted is a hot mess from a UX perspective.

olekennethOP3y ago

That's just the Swedish part of Schibsted. We're vg.no and Norwegian.

deskamess3y ago

Related/Off Topic: Is there a documented way to improve the accuracy of a particular language model? Say we can put in the effort to collect 1000's of verified/transcribed samples of a language that is currently scored poorly (WER). What steps do I have to take to get those improvements into the system?

binarymax3y ago

Yes, you need to fine-tune the model with your data. This might be easy or hard, depending on your experience level and complexity of the model and available tooling.

For this model specifically (https://github.com/openai/whisper) it would be a significant challenge for a newcomer. Luckily Huggingface has a blog post that will get you started: https://huggingface.co/blog/fine-tune-whisper

deskamess3y ago

Thank you. I do consider myself programming able but new to the ML ecosystem.

teucris3y ago

Very cool - I have a homegrown setup where a script scans my iCloud audio notes directory and generates transcriptions for any new notes. Works like a charm.

phodo3y ago

Is it running on a max or are you accessing iCloud from a non-Mac?

teucris3y ago

It’s on a Mac mini I use as a home server.

mkl3y ago

Looks interesting. I noticed that the README says "containe" or "containes" several times, where I think you mean "container(s)".

olekennethOP3y ago

Thanks for feedback. Looks like the Readme need more work

MitPitt3y ago

There's no GUI though

olekennethOP3y ago

Working on it. https://github.com/schibsted/WAAS/pull/101

j / k navigate · click thread line to collapse

54 comments

sebastianvoelkl3y ago

graderjs3y ago

Can you not separate into two phases? Speech separation to get source per speaker, and then whisper on each in isolation (maybe interlacing prompts)?

19h3y ago

https://github.com/ggerganov/whisper.cpp has diarization

nasir3y ago

I'm badly looking for that! Is there a repo I can follow?

swyx3y ago

not GP (hoping he responds tho) but i've been collecting a couple of diarization options: https://github.com/sw-yx/ai-notes/blob/main/AUDIO.md

basically whisper.cpp has some support but its not great (based on my own testing)

- https://huggingface.co/spaces/vumichien/whisper-speaker-diar...

- https://github.com/Majdoddin/nlp pyannote diarization

- whisperX with diarization https://twitter.com/maxhbain/status/1619698716914622466 https://github.com/m-bain/whisperX

1 more reply

sebastianvoelkl3y ago

I can share my repo when it's finished. In the meantime, you can take a look at this: https://huggingface.co/spaces/vumichien/whisper-speaker-diar...

1 more reply

olekennethOP3y ago

Ok. Our service is pretty fast. Also the M-Macs is really fast imo

jalino233y ago

whats your training rig like?

henry_viii3y ago

By the way there is also another project called Whisper.cpp:

https://github.com/ggerganov/whisper.cpp

Which uses x8 less memory than the Python implementation for the tiny model. It would be a good idea to keep an eye on it since there are Python bindings planned on the roadmap:

https://github.com/ggerganov/whisper.cpp#bindings

sorenjan3y ago

And there's a fork of that that uses DirectCompute to run it on GPUs without Cuda on Windows:

https://github.com/Const-me/Whisper

kardianos3y ago

I used whisper.cpp to build a tool to transcribe audio files, either as a one off or as a folder watcher:

https://github.com/kardianos/audioclerk

Built in Go/cgo.

Metus3y ago

Is there a fork running it on Apple's Neural Engine?

adlpz3y ago

I understand this is self-hosting the OpenAI Whisper model (which I see is fully MIT-licensed, weights and all). So not calling any OpenAI APIs like other GPT-related tools do.

Am I correct on this? The README is not explicit.

olekennethOP3y ago

Yes. This runs the OpenAI Whisper model locally.

silviot3y ago

People interested in this might also be interested in transcribe-anything [1].

It automates video fetching and uses whisper to generate .srt, .vtt and .txt files.

[1] https://github.com/zackees/transcribe-anything

olekennethOP3y ago

This have the same output formats plus it's own Jojo-format to open in an atm internal Mac-app and in the editor-feature-branch.

raybb3y ago

magicseth3y ago

Is it possible to create a streaming endpoint that returns real-time transcriptions?

nojs3y ago

Some more useful discussion here (whisper.cpp project, but still relevant) [1].

0. https://github.com/openai/whisper/discussions/332

1. https://github.com/ggerganov/whisper.cpp/issues/10

monkeydust3y ago

tsycho3y ago

Is there an open source speech recognition model which can be restricted to a smaller domain-specific dictionary?

Use case: I want to transcribe my poker hands while playing, eg: "Flop was 2 of spaces, 3 of diamonds and King of spades", "Button raised to $20" etc.

Any suggestions on how to go about this?

levpopov3y ago

You can prefix a prompt for Whisper with a small text section containing desired vocab, and it will likely improve accuracy for that specific domain.

Whisper source is very readable, check out https://github.com/openai/whisper/blob/main/whisper/decoding...

nshm3y ago

Vosk

https://alphacephei.com/vosk/lm

You can restrict the vocabulary the way you like, for example, here is the chess app built with Vosk

https://www.chessvis.com/

elliotpage3y ago

This looks really good, thanks! Really appreciate this and all the other Whisper implementations in this thread as I am sorting up transcriptions for my 120+ podcast episodes.

olekennethOP3y ago

Awesome!

jonititan3y ago

That's very interesting. I've been using whisper via pip also but I'm surprised you haven't sought to optimize whisper at all?

I've been looking at using compilation in torch but not successful yet as otherwise it can take awhile to run. https://pytorch.org/tutorials/intermediate/torch_compile_tut...

sgt3y ago

Is the Whisper model better than say Youtube's auto transcribing? I hope it is because the one on YT gets so much wrong it's almost comical.

beardedetim3y ago

We use whisper at $dayjob and we found that it is far better than Azures transcripts and we found that Azure and GCP had about the same correctness.

I'd assume whisper will be better than YT auto ones for sure, especially if you choose the right model

Jach3y ago

kamranjon3y ago

Have you tried setting condition_on_previous_text to False?

1 more reply

olekennethOP3y ago

But more impressive is that it understand a lot of other languages then just English. Really impressed with hard-to-understand Norwegian dialects.

AB19083y ago

It is pretty good. I tried it a few times and there were minor mistakes at best. The Whisper readme itself shows a live transcription use case.

INTPenis3y ago

Has anyone looked at the code? Because as a Swedish citizen I must say that anything I use by Schibsted is a hot mess from a UX perspective.

olekennethOP3y ago

That's just the Swedish part of Schibsted. We're vg.no and Norwegian.

deskamess3y ago

binarymax3y ago

Yes, you need to fine-tune the model with your data. This might be easy or hard, depending on your experience level and complexity of the model and available tooling.

deskamess3y ago

Thank you. I do consider myself programming able but new to the ML ecosystem.

teucris3y ago

Very cool - I have a homegrown setup where a script scans my iCloud audio notes directory and generates transcriptions for any new notes. Works like a charm.

phodo3y ago

Is it running on a max or are you accessing iCloud from a non-Mac?

teucris3y ago

It’s on a Mac mini I use as a home server.

mkl3y ago

Looks interesting. I noticed that the README says "containe" or "containes" several times, where I think you mean "container(s)".

olekennethOP3y ago

Thanks for feedback. Looks like the Readme need more work

MitPitt3y ago

There's no GUI though

olekennethOP3y ago

Working on it. https://github.com/schibsted/WAAS/pull/101

j / k navigate · click thread line to collapse