[1]: https://huggingface.co/spaces/hf-audio/open_asr_leaderboard
I'm actually a little surprised they haven't added model size to that chart.
Oh and I type this in handy with just my voice and parakeet version three, which is absolutely crazy.
And handy even takes care of all the punctuation, which is really nice.
Thanks a lot for suggesting it to me. I actually wanted something like this, and I was using something like Google Docs, and it required me to use Chrome to get the speech to text version, and I actually ended up using Orion for that because Orion can actually work as a Chrome for some reason while still having both Firefox and Chrome extension support. So and I had it installed, but yeah.
This is really amazing and actually a sort of lifesaver actually, so thanks a lot, man.
Now I can actually just speak and this can convert this to text without having to go through any non-local model or Google Docs or whatever anything else.
Why is this so good man? It's so good
man, I actually now am thinking that I had like fully maxed out my typing speed to like hundred-120. But like this can actually write it faster. you know it's pretty amazing actually.
Have a nice day, or as I abbreviate it, HAND, smiley face. :D
That's unfortunate. I think I can update my version but I have heard some bad things about performance from the newer update from my elder brother.
I can tell that this is now definitely going to be my go-to model and app on all my clients.
The one built in is much faster, and you only have to toggle it on.
Are these so much more accurate? I definitely have to correct stuff, but pretty good experience.
Also use speech to text on my iphone which seems to be the same accuracy.
One note for anyone using Handy with codex-cli on macOS: the default "Option + Space" shortcut inserts spaces mid-speech. "Left Ctrl + Fn" works cleanly instead. I'm curious to know which shortcuts you're using.
edit: holy shit parakeet is good.... Moonshine impressive too and it is half the param
Now if only there was something just as quick as Parakeet v3 for TTS ! Then I can talk to codex all day long!!!
Very lightweight and good quality
I think most apps that use Parakeet tend to use this version of the model?
See if Parakeet (Nemotron) still uses 4GB+ with my implementation: https://rift-transcription.vercel.app/local-setup
I built a macOS dictation app (https://github.com/T0mSIlver/localvoxtral) on top of Voxtral Realtime, and the UX difference between streaming and offline STT is night and day. Words appearing while you're still talking completely changes the feedback loop. You catch errors in real time, you can adjust what you're saying mid-sentence, and the whole thing feels more natural. Going back to "record then wait" feels broken after that.
Curious how Moonshine's streaming latency compares in practice. Do you have numbers on time-to-first-token for the streaming mode? And on the serving side, do any of the integration options expose an OpenAI Realtime-compatible WebSocket endpoint?
I made moonshine the default because it has the best accuracy/latency (aside from Web Speech API, but that is not fully local)
I plan to add objective benchmarks in the future, so multiple models can be compared against the same audio data...
---
I made a custom WebSocket server for my project. It defines its own API (modeled on the Sherpa-onnx API), but you could adjust it to output the OpenAI Realtime API: https://github.com/Leftium/rift-local
(note rift-local is optimized for single connections, or rather not optimized to handle multiple WS connections)
I'd love a faster and more accurate option than Whisper, but streamers need something off-the-shelf they can install in their pipeline, like an OBS plugin which can just grab the audio from their OBS audio sources.
I see a couple obvious problems: this doesn't seem to support translation which is unfortunate, that's pretty key for this usecase. Also it only supports one language at a time, which is problematic with how streamers will frequently code-switch while talking to their chat in different languages or on Discord with their gameplay partners. Maybe such a plugin would be able to detect which language is spoken and route to one or the other model as needed?
The authors do acknowledge this though and give a slightly too complex way to do this with uv in an example project (FYI, you dont need to source anything if you use uv run)
The minimum useful data for this stuff is a small table of language | WER for dataset
There was an issue with a demo but it's missing now. I can't recall for sure but I think I got it working locally myself too but then found it broke unexpectedly and I didn't manage to find out why.
I also did a survey of other in-browser transcription solutions: https://github.com/Leftium/rift-transcription/blob/main/refe...
- Notably, there is an (unrelated?) moonshine demo based on transformers.js (using WebGPU) with WASM fallback.
Weird to only release English as open weights.
hear about what people might build with it
My startup is making software for firefighters to use during missions on tablets, excited to see (when I get the time) if we can use this as a keyboard alternative on the device. It's a use case where avoiding "clunky" is important and a perfect usecase for speech-to-text.Due to the sector being increasingly worried about "hybrid threats" we try to rely on the cloud as little as possible and run things either on device or with the possibility of being self-hosted/on-premise. I really like the direction your company is going in in this respect.
We'd probably need custom training -- we need Norwegian, and there's some lingo, e.g., "bravo one two" should become "B-1.2". While that can perhaps also be done with simple post-processing rules, we would also probably want such examples in training for improved recognition? Have no VC funding, but looking forward to getting some income so that we can send some of it in your direction :)
Edit: It was https://muninai.eu (I shut down the backend server yesterday so the functionality is disabled).
It's incredible for a live transcription stream - the latency is WOW.
For the open source folks, that's also set up in handy, I think.
uv tool install rift-local && rift-local serve --open
This opens RIFT[1], my web frontend for local transcription with a copy button. You can also compare against Web Speech API and other models (including cloud API's).> This code, apart from the source in core/third-party, is licensed under the MIT License, see LICENSE in this repository.
> The English-language models are also released under the MIT License. Models for other languages are released under the Moonshine Community License, which is a non-commercial license.
> The code in core/third-party is licensed according to the terms of the open source projects it originates from, with details in a LICENSE file in each subfolder.
The English-language models are also released under the MIT License. Models for other languages are released under the Moonshine Community License, which is a non-commercial license.
The code in core/third-party is licensed according to the terms of the open source projects it originates from, with details in a LICENSE file in each subfolder."