Some implementation details, since getting this to work well was not trivial.
My goal was “press hotkey, start talking, see text within ~1–2 seconds” on an M2 MacBook Pro, and support multiple languages.
First attempts (cloud)
– I tried Hugging Face real-time transcription. It worked but latency was all over the place and costs would not scale.
– I tried OpenAI real-time transcription. Latency was better, but when there was background noise, it'd transcribe wrong things. Saw 200ms responses. I can bring that back if I can make it stable.
– I briefly experimented with Gemini for transcribing and formatting multi-language text. Quality was not consistent enough compared to Whisper for Multi language.
Local experiments
– I used FFmpeg + Whisper CLI in a bunch of ways: batching, buffering, trying to “stream” partial results out of Whisper to make it feel live.
– I also tried a local Llama model to format the raw transcript into an email. On an M2 Pro this took ~2 seconds for short emails and got much slower for long text. It looked nice but the latency was not acceptable for everyday use.
Where I ended up (for now)
– Current version sticks to FFmpeg + Whisper CLI locally, optimized for short chunks so you usually see text within about 1–2 seconds.
– I dropped the heavy on-device LLM formatting and keep the formatting logic much simpler so it stays predictable and fast.
Next step is to re-introduce “smart” formatting and meeting notes, but only when I can do it without blowing up latency. Happy to dig deeper into any of these if people are curious.