basically whisper.cpp has some support but its not great (based on my own testing)
- https://huggingface.co/spaces/vumichien/whisper-speaker-diar...
- https://github.com/Majdoddin/nlp pyannote diarization
- whisperX with diarization https://twitter.com/maxhbain/status/1619698716914622466 https://github.com/m-bain/whisperX
https://github.com/ggerganov/whisper.cpp
Which uses x8 less memory than the Python implementation for the tiny model. It would be a good idea to keep an eye on it since there are Python bindings planned on the roadmap:
https://github.com/kardianos/audioclerk
Built in Go/cgo.
Am I correct on this? The README is not explicit.
It automates video fetching and uses whisper to generate .srt, .vtt and .txt files.
You could do better by overlapping the segments, except then stitching the transcriptions together becomes an issue since whisper doesn't provide reliable per-token timestamps [0], and the output of the common part of overlapping segments isn't necessarily the same. I can imagine a cool approach where you transcribe long, overlapping chunks in real-time and intelligently merge the stream of words somehow though.
Some more useful discussion here (whisper.cpp project, but still relevant) [1].
Use case: I want to transcribe my poker hands while playing, eg: "Flop was 2 of spaces, 3 of diamonds and King of spades", "Button raised to $20" etc.
When I tried using Whisper and some other model, the recognition accuracy was atrocious, and it kept finding non-poker words that sounded similar to poker words. I want to restrict its search space to my own list of poker words which should significantly increase the accuracy (theoretically).
Any suggestions on how to go about this?
Whisper source is very readable, check out https://github.com/openai/whisper/blob/main/whisper/decoding...
https://alphacephei.com/vosk/lm
You can restrict the vocabulary the way you like, for example, here is the chess app built with Vosk
I've been looking at using compilation in torch but not successful yet as otherwise it can take awhile to run. https://pytorch.org/tutorials/intermediate/torch_compile_tut...
I'd assume whisper will be better than YT auto ones for sure, especially if you choose the right model
On one source I tried the other day, the first 90 seconds or so is just generic opening music, no speech, but it "transcribes" it as "This is the end of the video. Thank you for watching. Please subscribe to the channel if you like. See you in the next video. Thank you for watching. Please subscribe to the channel if you like. Thank you for watching. ..." If you help it along by cutting up the source into only spoken segments you can get it to do better but just throwing it at a directory of material is probably going to leave you with some disappointment.
Then sometimes it does something surprising, on a j-pop song after hallucinating a bit during the intro it spit out a translation in the form you might find on a lyrics site, that is each line was "japanese-characters romaji-version english-translation". I haven't been able to get it to do it again (even for the same source).
For this model specifically (https://github.com/openai/whisper) it would be a significant challenge for a newcomer. Luckily Huggingface has a blog post that will get you started: https://huggingface.co/blog/fine-tune-whisper