Show HN: Moonshine Open-Weights STT models – higher accuracy than WhisperLargev3 (opens in new tab)

(github.com)

316 pointspetewarden3mo ago81 comments

I wanted to share our new speech to text model, and the library to use them effectively. We're a small startup (six people, sub-$100k monthly GPU budget) so I'm proud of the work the team has done to create streaming STT models with lower word-error rates than OpenAI's largest Whisper model. Admittedly Large v3 is a couple of years old, but we're near the top the HF OpenASR leaderboard, even up against Nvidia's Parakeet family. Anyway, I'd love to get feedback on the models and software, and hear about what people might build with it.

81 comments

Karrot_Kream3mo ago

According to the OpenASR Leaderboard [1], looks like Parakeet V2/V3 and Canary-Qwen (a Qwen finetune) handily beat Moonshine. All 3 models are open, but Parakeet is the smallest of the 3. I use Parakeet V3 with Handy and it works great locally for me.

[1]: https://huggingface.co/spaces/hf-audio/open_asr_leaderboard

reitzensteinm3mo ago

Parakeet V3 is over twice the parameter count of Moonshine Medium (600m vs 245m), so it's not an apples to apples comparison.

I'm actually a little surprised they haven't added model size to that chart.

bytesandbits3mo ago

parakeet v3 has a much better RTFx than moonshine, it's not just about parameter numbers. Runs faster.

https://huggingface.co/spaces/hf-audio/open_asr_leaderboard

2 more replies

agentifysh3mo ago

So I'm kinda new to this whole parakeet and moonshine stuff, and I'm able to run parakeet on a low end CPU without issues, so I'm curious as to how much that extra savings on parameters is actually gonna translate.

Oh and I type this in handy with just my voice and parakeet version three, which is absolutely crazy.

Imustaskforhelp3mo ago

To this comment and all the other comments talking about handy below this comment. I tried handy right now and it's super amazing. I'm speaking this from Handy. This is so cool, man.

And handy even takes care of all the punctuation, which is really nice.

Thanks a lot for suggesting it to me. I actually wanted something like this, and I was using something like Google Docs, and it required me to use Chrome to get the speech to text version, and I actually ended up using Orion for that because Orion can actually work as a Chrome for some reason while still having both Firefox and Chrome extension support. So and I had it installed, but yeah.

This is really amazing and actually a sort of lifesaver actually, so thanks a lot, man.

Now I can actually just speak and this can convert this to text without having to go through any non-local model or Google Docs or whatever anything else.

Why is this so good man? It's so good

man, I actually now am thinking that I had like fully maxed out my typing speed to like hundred-120. But like this can actually write it faster. you know it's pretty amazing actually.

Have a nice day, or as I abbreviate it, HAND, smiley face. :D

d4rkp4ttern3mo ago

Was a big fan of Handy until I found Hex, which, incredibly, has even faster transcription (with Parakeet V3), it’s MacOS only:

https://github.com/kitlangton/Hex

Imustaskforhelp3mo ago

I tried this out but the brew command errors out saying it only works on macOS versions older than Sequoia.

That's unfortunate. I think I can update my version but I have heard some bad things about performance from the newer update from my elder brother.

2 more replies

theologic3mo ago

By the way, I've been using a Whisper model, specifically WhisperX, to do all my work, and for whatever reason I just simply was not familiar with the Handy app. I've now downloaded and used it, and what a great suggestion. Thank you for putting it here, along with the direct link to the leaderboard.

I can tell that this is now definitely going to be my go-to model and app on all my clients.

jasonjmcghee3mo ago

I have to ask- I see this handy app running on Mac and you hold a key down and then it doesn't show until seemingly a while later.

The one built in is much faster, and you only have to toggle it on.

Are these so much more accurate? I definitely have to correct stuff, but pretty good experience.

Also use speech to text on my iphone which seems to be the same accuracy.

kardaj3mo ago

I'm building a local-first transcription iOS app and have been on Whisper Medium, switching to Parakeet V3 based on this.

One note for anyone using Handy with codex-cli on macOS: the default "Option + Space" shortcut inserts spaces mid-speech. "Left Ctrl + Fn" works cleanly instead. I'm curious to know which shortcuts you're using.

bn-usd-mistake3mo ago

I am looking for such an app. Main use case is transcribing voice notes received on Signal while preserving privacy. Please post when you launch :)

tuananh3mo ago

Handy is amazing. Super quality app.

agentifysh3mo ago

It really is. It's kinda ridiculous that it's free.

2 more replies

tomr753mo ago

why V3 over V2 (assuming English only)?

agentifysh3mo ago

hmmm looks like assembyAI is still unbeatable here in terms of cost/performance unless im mistaken

edit: holy shit parakeet is good.... Moonshine impressive too and it is half the param

Now if only there was something just as quick as Parakeet v3 for TTS ! Then I can talk to codex all day long!!!

fittingopposite3mo ago

Also running parakeet on my phone with https://github.com/notune/android_transcribe_app

Very lightweight and good quality

1 more reply

remuskaos3mo ago

Parakeet doesn't require a GPU. I'm handily running it on my Ubuntu Linux laptop.

2 more replies

Dayshine3mo ago

What's wrong with piper?

syntaxing3mo ago

How much VRAM does parakeet take for you? For some reason it takes 4GB+ for me using the onyx version even though it’s 600M parameters

Leftium2mo ago

There are different versions of the parakeet model. The 8-bit quantized version doesn't use as many bits. Thus it saves space (only using about 600MB) while maintaining about the same level of accuracy.

I think most apps that use Parakeet tend to use this version of the model?

See if Parakeet (Nemotron) still uses 4GB+ with my implementation: https://rift-transcription.vercel.app/local-setup

T0mSIlver3mo ago

Congrats on the results. The streaming aspect is what I find most exciting here.

I built a macOS dictation app (https://github.com/T0mSIlver/localvoxtral) on top of Voxtral Realtime, and the UX difference between streaming and offline STT is night and day. Words appearing while you're still talking completely changes the feedback loop. You catch errors in real time, you can adjust what you're saying mid-sentence, and the whole thing feels more natural. Going back to "record then wait" feels broken after that.

Curious how Moonshine's streaming latency compares in practice. Do you have numbers on time-to-first-token for the streaming mode? And on the serving side, do any of the integration options expose an OpenAI Realtime-compatible WebSocket endpoint?

Leftium2mo ago

My app uses this moonshine-voice python package, so you can experience it yourself here: https://rift-transcription.vercel.app/local-setup

I made moonshine the default because it has the best accuracy/latency (aside from Web Speech API, but that is not fully local)

I plan to add objective benchmarks in the future, so multiple models can be compared against the same audio data...

---

I made a custom WebSocket server for my project. It defines its own API (modeled on the Sherpa-onnx API), but you could adjust it to output the OpenAI Realtime API: https://github.com/Leftium/rift-local

(note rift-local is optimized for single connections, or rather not optimized to handle multiple WS connections)

francislavoie3mo ago

I've helped many Twitch streamers set up https://github.com/royshil/obs-localvocal to plug transcription & translation into their streams, mainly for German audio to English subtitles.

I'd love a faster and more accurate option than Whisper, but streamers need something off-the-shelf they can install in their pipeline, like an OBS plugin which can just grab the audio from their OBS audio sources.

I see a couple obvious problems: this doesn't seem to support translation which is unfortunate, that's pretty key for this usecase. Also it only supports one language at a time, which is problematic with how streamers will frequently code-switch while talking to their chat in different languages or on Discord with their gameplay partners. Maybe such a plugin would be able to detect which language is spoken and route to one or the other model as needed?

mattmcegg2mo ago

I released a OBS plugin (and optional RTMP relay) that does exactly this. It can do real time translated captions and voice cloning/dubbing. The plugin lets you choose an audio source, then creates each language's captions and dub as new Sources. Use them however you'd like! check it out! https://streamfluent.ai

heftykoo3mo ago

Claiming higher accuracy than Whisper Large v3 is a bold opening move. Does your evaluation account for Whisper's notorious hallucination loops during silences (the classic 'Thank you for watching!'), or is this purely based on WER on clean datasets? Also, what's the VRAM footprint for edge deployments? If it fits on a standard 8GB Mac without quantization tricks, this is huge.

asqueella3mo ago

For those wondering about the language support, currently English, Arabic, Japanese, Korean, Mandarin, Spanish, Ukrainian, Vietnamese are available (most in Base size = 58M params)

ac293mo ago

No idea why 'sudo pip install --break-system-packages moonshine-voice' is the recommended way to install on raspi?

The authors do acknowledge this though and give a slightly too complex way to do this with uv in an example project (FYI, you dont need to source anything if you use uv run)

1 more reply

fareesh3mo ago

Accuracy is often presumed to be english, which is fine, but it's a vague thing to say "higher" because does it mean higher in English only? Higher in some subset of languages? Which ones?

The minimum useful data for this stuff is a small table of language | WER for dataset

nmstoker3mo ago

Any plans regarding JavaScript support in the browser?

There was an issue with a demo but it's missing now. I can't recall for sure but I think I got it working locally myself too but then found it broke unexpectedly and I didn't manage to find out why.

Leftium2mo ago

WASM-based port: https://github.com/moonshine-ai/moonshine-js

I also did a survey of other in-browser transcription solutions: https://github.com/Leftium/rift-transcription/blob/main/refe...

- Notably, there is an (unrelated?) moonshine demo based on transformers.js (using WebGPU) with WASM fallback.

RobotToaster3mo ago

> Models for other languages are released under the Moonshine Community License, which is a non-commercial license.

Weird to only release English as open weights.

riedel3mo ago

I find it an even more weird practice for anyone working with speech or text models not in the first paragraph name the language it is meant for (and I do not mean the programming language bindings). How many English native speakers are there 5% of the world population?

RobotToaster3mo ago

Approximately yes, although another 15% are non-native English speakers. Chinese is a close second for total speakers.

dagss3mo ago

Very exciting stuff!

    hear about what people might build with it

My startup is making software for firefighters to use during missions on tablets, excited to see (when I get the time) if we can use this as a keyboard alternative on the device. It's a use case where avoiding "clunky" is important and a perfect usecase for speech-to-text.

Due to the sector being increasingly worried about "hybrid threats" we try to rely on the cloud as little as possible and run things either on device or with the possibility of being self-hosted/on-premise. I really like the direction your company is going in in this respect.

We'd probably need custom training -- we need Norwegian, and there's some lingo, e.g., "bravo one two" should become "B-1.2". While that can perhaps also be done with simple post-processing rules, we would also probably want such examples in training for improved recognition? Have no VC funding, but looking forward to getting some income so that we can send some of it in your direction :)

steinvakt23mo ago

Interesting. Can we get in touch? I just sold my webapp/saas where I used NB-Whisper to transcribe Norwegian media (podcast, radio, TV) and offer alerts and search by indexing it using elasticsearch.

Edit: It was https://muninai.eu (I shut down the backend server yesterday so the functionality is disabled).

dagss3mo ago

Sure! I didn't find your contact info but drop me an email at dag@syncmap.no.

sourcetms3mo ago

I'm offering support for this in Resonant - Already set up and running this week.

It's incredible for a live transcription stream - the latency is WOW.

https://www.onresonant.com/

For the open source folks, that's also set up in handy, I think.

admiralrohan3mo ago

Is this alternative to Whispr Flow?

armcat3mo ago

This is awesome, well done guys, I’m gonna try it as my ASR component on the local voice assistant I’ve been building https://github.com/acatovic/ova. The tiny streaming latencies you show look insane

Leftium2mo ago

Try Moonshine with a browser GUI:

    uv tool install rift-local && rift-local serve --open

This opens RIFT[1], my web frontend for local transcription with a copy button. You can also compare against Web Speech API and other models (including cloud API's).

https://github.com/Leftium/rift-local

[1]: https://rift-transcription.vercel.app/local-setup

binome3mo ago

I vibe-trained moonshine-tiny on amateur radio morse code last weekend, and was surprised at the ~2% CER I was seeing in evals and over the air performance was pretty acceptable for a couple hour run on a 4090.

9999000009993mo ago

Very cool. Anyway to run this in Web assembly, I have a project in mind

lostmsu3mo ago

How does it compare to Microsoft VibeVoice ASR https://news.ycombinator.com/item?id=46732776 ?

g-mork3mo ago

How does this compare to Parakeet, which runs wonderfully on CPU?

Ross007812mo ago

Open-weight STT models hitting production-grade accuracy is huge for privacy-sensitive deployments. Whisper was already impressive, but having competitive alternatives means we're not locked into a single model family. The real test will be multilingual performance and edge device efficiency—has anyone benchmarked this on M-series or Jetson?

pzo3mo ago

haven't tested yet but I'm wondering how it will behave when talking about many IT jargon and tech acronyms. For those reason I had to mostly run LLM after STT but that was slowing done parakeet inference. Otherwise had problems to detect properly sometimes when talking about e.g. about CoreML, int8, fp16, half float, ARKit, AVFoundation, ONNX etc.

Ross007813mo ago

The streaming architecture looks really promising for edge deployments. One thing I'm curious about: how does the caching mechanism handle multiple concurrent audio streams? For example, in a meeting transcription scenario with 4-5 speakers, would each stream maintain its own cache, or is there shared state that could create bottlenecks?

saltwounds3mo ago

Streaming transcription is crazy fast on an M1. Would be great to use this as a local option versus Wispr Flow.

oezi3mo ago

Do you also support timestamps the detected word or even down to characters?

fittingopposite3mo ago

Which program does support it to allow streaming? Currently using spokenly and parakeet but would like to transition to a model that is streaming instead of transcribing chunk wise.

regularfry3mo ago

Oh this is fantastic. I'm most interested to see if this reaches down to the raspberry pi zero 2, because that's a whole new ballgame if it does.

dSebastien3mo ago

I've been using Moonshine since V1 and the results are really great. I'd say on par with Parakeet V3 while working really well with CPU only.

sroussey3mo ago

onnx models for browser possible?

starkparker3mo ago

Implemented this to transcribe voice chat in a project and the streaming accuracy in English on this was unusable, even with the medium streaming model.

fudged712mo ago

If it's using ONNX, can this be ported to Transformers.js?

alexnewman3mo ago

If only it did Doric

raybb3mo ago

fyi the typepad link in your bio is broken

cyanydeez3mo ago

No LICENSE no go

bangaladore3mo ago

There is a license blurb in the readme.

> This code, apart from the source in core/third-party, is licensed under the MIT License, see LICENSE in this repository.

> The English-language models are also released under the MIT License. Models for other languages are released under the Moonshine Community License, which is a non-commercial license.

> The code in core/third-party is licensed according to the terms of the open source projects it originates from, with details in a LICENSE file in each subfolder.

mkl3mo ago

The LICENSE file that refers to is missing. There's one in the python folder, but not for the rest of the code.

1 more reply

altruios3mo ago

reading through readme.md "License This code, apart from the source in core/third-party, is licensed under the MIT License, see LICENSE in this repository.

The English-language models are also released under the MIT License. Models for other languages are released under the Moonshine Community License, which is a non-commercial license.

The code in core/third-party is licensed according to the terms of the open source projects it originates from, with details in a LICENSE file in each subfolder."

j / k navigate · click thread line to collapse

81 comments

Karrot_Kream3mo ago

[1]: https://huggingface.co/spaces/hf-audio/open_asr_leaderboard

reitzensteinm3mo ago

Parakeet V3 is over twice the parameter count of Moonshine Medium (600m vs 245m), so it's not an apples to apples comparison.

I'm actually a little surprised they haven't added model size to that chart.

bytesandbits3mo ago

parakeet v3 has a much better RTFx than moonshine, it's not just about parameter numbers. Runs faster.

https://huggingface.co/spaces/hf-audio/open_asr_leaderboard

2 more replies

agentifysh3mo ago

Oh and I type this in handy with just my voice and parakeet version three, which is absolutely crazy.

Imustaskforhelp3mo ago

To this comment and all the other comments talking about handy below this comment. I tried handy right now and it's super amazing. I'm speaking this from Handy. This is so cool, man.

And handy even takes care of all the punctuation, which is really nice.

This is really amazing and actually a sort of lifesaver actually, so thanks a lot, man.

Now I can actually just speak and this can convert this to text without having to go through any non-local model or Google Docs or whatever anything else.

Why is this so good man? It's so good

man, I actually now am thinking that I had like fully maxed out my typing speed to like hundred-120. But like this can actually write it faster. you know it's pretty amazing actually.

Have a nice day, or as I abbreviate it, HAND, smiley face. :D

d4rkp4ttern3mo ago

Was a big fan of Handy until I found Hex, which, incredibly, has even faster transcription (with Parakeet V3), it’s MacOS only:

https://github.com/kitlangton/Hex

Imustaskforhelp3mo ago

I tried this out but the brew command errors out saying it only works on macOS versions older than Sequoia.

That's unfortunate. I think I can update my version but I have heard some bad things about performance from the newer update from my elder brother.

2 more replies

theologic3mo ago

I can tell that this is now definitely going to be my go-to model and app on all my clients.

jasonjmcghee3mo ago

I have to ask- I see this handy app running on Mac and you hold a key down and then it doesn't show until seemingly a while later.

The one built in is much faster, and you only have to toggle it on.

Are these so much more accurate? I definitely have to correct stuff, but pretty good experience.

Also use speech to text on my iphone which seems to be the same accuracy.

kardaj3mo ago

I'm building a local-first transcription iOS app and have been on Whisper Medium, switching to Parakeet V3 based on this.

bn-usd-mistake3mo ago

I am looking for such an app. Main use case is transcribing voice notes received on Signal while preserving privacy. Please post when you launch :)

tuananh3mo ago

Handy is amazing. Super quality app.

agentifysh3mo ago

It really is. It's kinda ridiculous that it's free.

2 more replies

tomr753mo ago

why V3 over V2 (assuming English only)?

agentifysh3mo ago

hmmm looks like assembyAI is still unbeatable here in terms of cost/performance unless im mistaken

edit: holy shit parakeet is good.... Moonshine impressive too and it is half the param

Now if only there was something just as quick as Parakeet v3 for TTS ! Then I can talk to codex all day long!!!

fittingopposite3mo ago

Also running parakeet on my phone with https://github.com/notune/android_transcribe_app

Very lightweight and good quality

1 more reply

remuskaos3mo ago

Parakeet doesn't require a GPU. I'm handily running it on my Ubuntu Linux laptop.

2 more replies

Dayshine3mo ago

What's wrong with piper?

syntaxing3mo ago

How much VRAM does parakeet take for you? For some reason it takes 4GB+ for me using the onyx version even though it’s 600M parameters

Leftium2mo ago

I think most apps that use Parakeet tend to use this version of the model?

See if Parakeet (Nemotron) still uses 4GB+ with my implementation: https://rift-transcription.vercel.app/local-setup

T0mSIlver3mo ago

Congrats on the results. The streaming aspect is what I find most exciting here.

Leftium2mo ago

My app uses this moonshine-voice python package, so you can experience it yourself here: https://rift-transcription.vercel.app/local-setup

I made moonshine the default because it has the best accuracy/latency (aside from Web Speech API, but that is not fully local)

I plan to add objective benchmarks in the future, so multiple models can be compared against the same audio data...

---

(note rift-local is optimized for single connections, or rather not optimized to handle multiple WS connections)

francislavoie3mo ago

I've helped many Twitch streamers set up https://github.com/royshil/obs-localvocal to plug transcription & translation into their streams, mainly for German audio to English subtitles.

mattmcegg2mo ago

heftykoo3mo ago

asqueella3mo ago

For those wondering about the language support, currently English, Arabic, Japanese, Korean, Mandarin, Spanish, Ukrainian, Vietnamese are available (most in Base size = 58M params)

ac293mo ago

No idea why 'sudo pip install --break-system-packages moonshine-voice' is the recommended way to install on raspi?

The authors do acknowledge this though and give a slightly too complex way to do this with uv in an example project (FYI, you dont need to source anything if you use uv run)

1 more reply

fareesh3mo ago

Accuracy is often presumed to be english, which is fine, but it's a vague thing to say "higher" because does it mean higher in English only? Higher in some subset of languages? Which ones?

The minimum useful data for this stuff is a small table of language | WER for dataset

nmstoker3mo ago

Any plans regarding JavaScript support in the browser?

There was an issue with a demo but it's missing now. I can't recall for sure but I think I got it working locally myself too but then found it broke unexpectedly and I didn't manage to find out why.

Leftium2mo ago

WASM-based port: https://github.com/moonshine-ai/moonshine-js

I also did a survey of other in-browser transcription solutions: https://github.com/Leftium/rift-transcription/blob/main/refe...

- Notably, there is an (unrelated?) moonshine demo based on transformers.js (using WebGPU) with WASM fallback.

RobotToaster3mo ago

> Models for other languages are released under the Moonshine Community License, which is a non-commercial license.

Weird to only release English as open weights.

riedel3mo ago

RobotToaster3mo ago

Approximately yes, although another 15% are non-native English speakers. Chinese is a close second for total speakers.

dagss3mo ago

Very exciting stuff!

    hear about what people might build with it

steinvakt23mo ago

Interesting. Can we get in touch? I just sold my webapp/saas where I used NB-Whisper to transcribe Norwegian media (podcast, radio, TV) and offer alerts and search by indexing it using elasticsearch.

Edit: It was https://muninai.eu (I shut down the backend server yesterday so the functionality is disabled).

dagss3mo ago

Sure! I didn't find your contact info but drop me an email at dag@syncmap.no.

sourcetms3mo ago

I'm offering support for this in Resonant - Already set up and running this week.

It's incredible for a live transcription stream - the latency is WOW.

https://www.onresonant.com/

For the open source folks, that's also set up in handy, I think.

admiralrohan3mo ago

Is this alternative to Whispr Flow?

armcat3mo ago

Leftium2mo ago

Try Moonshine with a browser GUI:

    uv tool install rift-local && rift-local serve --open

This opens RIFT[1], my web frontend for local transcription with a copy button. You can also compare against Web Speech API and other models (including cloud API's).

https://github.com/Leftium/rift-local

[1]: https://rift-transcription.vercel.app/local-setup

binome3mo ago

9999000009993mo ago

Very cool. Anyway to run this in Web assembly, I have a project in mind

lostmsu3mo ago

How does it compare to Microsoft VibeVoice ASR https://news.ycombinator.com/item?id=46732776 ?

g-mork3mo ago

How does this compare to Parakeet, which runs wonderfully on CPU?

Ross007812mo ago

pzo3mo ago

Ross007813mo ago

saltwounds3mo ago

Streaming transcription is crazy fast on an M1. Would be great to use this as a local option versus Wispr Flow.

oezi3mo ago

Do you also support timestamps the detected word or even down to characters?

fittingopposite3mo ago

Which program does support it to allow streaming? Currently using spokenly and parakeet but would like to transition to a model that is streaming instead of transcribing chunk wise.

regularfry3mo ago

Oh this is fantastic. I'm most interested to see if this reaches down to the raspberry pi zero 2, because that's a whole new ballgame if it does.

dSebastien3mo ago

I've been using Moonshine since V1 and the results are really great. I'd say on par with Parakeet V3 while working really well with CPU only.

sroussey3mo ago

onnx models for browser possible?

starkparker3mo ago

Implemented this to transcribe voice chat in a project and the streaming accuracy in English on this was unusable, even with the medium streaming model.

fudged712mo ago

If it's using ONNX, can this be ported to Transformers.js?

alexnewman3mo ago

If only it did Doric

raybb3mo ago

fyi the typepad link in your bio is broken

cyanydeez3mo ago

No LICENSE no go

bangaladore3mo ago

There is a license blurb in the readme.

> This code, apart from the source in core/third-party, is licensed under the MIT License, see LICENSE in this repository.

> The English-language models are also released under the MIT License. Models for other languages are released under the Moonshine Community License, which is a non-commercial license.

> The code in core/third-party is licensed according to the terms of the open source projects it originates from, with details in a LICENSE file in each subfolder.

mkl3mo ago

The LICENSE file that refers to is missing. There's one in the python folder, but not for the rest of the code.

1 more reply

altruios3mo ago

reading through readme.md "License This code, apart from the source in core/third-party, is licensed under the MIT License, see LICENSE in this repository.

The English-language models are also released under the MIT License. Models for other languages are released under the Moonshine Community License, which is a non-commercial license.

The code in core/third-party is licensed according to the terms of the open source projects it originates from, with details in a LICENSE file in each subfolder."

j / k navigate · click thread line to collapse