Gemini 3 Pro Preview has superlative audio listening comprehension. If I send it a recording of myself in a car, with me talking, and another passenger talking to the driver, and the radio playing, me in English, the radio in Portuguese, and the driver+passenger in Spanish, Gemini can parse all 4 audio streams as well as other background noises and give a translation for each one, including figuring out which voice belongs to which person, and what everyone's names are (if it's possible to figure that out from the conversation).
I'm sure it would have superlative audio generation capabilities too, if such a feature were enabled.
Sharing my setup in case it may be useful for others; it's especially useful when working with CLI agents like Code Code or Codex-CLI:
STT: Hex [1] (open-source), with Parakeet V3 - stunningly fast, near-instant transcription. The slight accuracy drop relative to bigger models is immaterial when you're talking to an AI. I always ask it to restate back to me what it understood, and it gives back a nicely structured version -- this helps confirm understanding as well as likely helps the CLI agent stay on track. It is a MacOS native app and leverages the CoreML/Neural Engine to get extremely fast transcription (I used to recommend a similar app Handy but it has frequent stuttering issues, and Hex is actually even faster, which I didn't think was possible!)
TTS: Kyutai's Pocket-TTS [2], just 100M params, and amazing speech quality (English only). I made a voice plugin [3] based on this, for Claude Code so it can speak out short updates whenever CC stops. It uses a combination of hooks that nudge the main agent to append a speakable summary, falling back to using a headless agent in case the main agent forgets. Turns out to be surprisingly useful. It's also fun as you can customize the speaking style and mirror your vibe and "colorful language" etc.
The voice plugin gives commands to control it:
/voice:speak stop
/voice:speak azelma (change the voice)
/voice:speak prompt <your arbitrary prompt to control the style>
[1] Hex https://github.com/kitlangton/Hex[2] Pocket-TTS https://github.com/kyutai-labs/pocket-tts
[3] Voice plugin for Claude Code: https://pchalasani.github.io/claude-code-tools/plugins-detai...
I had cause to do the the opposite: Hotkey -> clipboard TTS
https://www.tavus.io/post/sparrow-0-advancing-conversational...
They'll optimize down the stack once they've sucked all the oxygen out of the room.
Little players won't be able to grow through the ceiling the giants create.
Thus far AI has only been used to create fan fiction clips that generate free marketing for legacy IP on TikTok. And the rights holders know that if AI gets good enough to make feature length movies then they'll be able to aggressively use various legal mechanisms to take the videos off major sites and pursue the creators. Long term it could potentially lower internal production costs by getting rid of actors & writers.
Music is very different. The production cost is already zero, and people generating their own Taylor Swift songs is a real competitive threat to Spotify etc.
Not qoez:
You have to balance market opportunities with the risk of reputational damage and litigation risk.
Video will probably make a lot more money than audio, so you are willing to take a bigger risk. Additionally, at least for Google there exists a strong synergy between their video generation models and YouTube, which makes it even more sensible for Google to make video models available to the public despite these risks.
NVIDIA's basically the galaxy's most successful arms dealer, selling to both sides while convincing everyone they're just "enabling innovation." The real rebels would be training audio models on potato-patched RP2040s. Brave souls, if they exist.
https://www.daily.co/blog/benchmarking-stt-for-voice-agents/
They’ll wait for progress to be made and then buy the capability/expertise/talent when the time is right.
While on a walk with mobile phone + earphones, dump an article/paper/HN-Post/github-repo into the mobile chat app (chat-gpt, claude or gemini), and use voice mode to have it walk you through it conversationally, so you can ask follow up questions during the walk-thru and the AI would do web-search etc. I know I could do something like this with NotebookLM, but I want to engage in the conversation, and NotebookLM does have interactive mode but it has been super-flaky to say the least.
I pay for ChatGPT Pro and the voice mode is really bad: it pretends to do web searches and makes up things, and when pushed says it didn't actually read the article. Also the voice sounds super-condescending.
Gemini Pro mobile app - similarly refuses to open links and sounds as if it's talking to a baby.
Claude mobile app was the best among these - the voice is very tolerable in terms of tone, but like the others it can't open links. I does do web searches, but gets some type of summaries of pages, and it doesn't actually go into the links themselves to give me details.
Each of the LoRA tunes we did took maybe 2-3 hours on the same A10 instance.
So I am hoping for something like PersonaPlex but a bit larger.
Has anyone tested MiniCPM-o?.How is it at instruction following?
But, this piece is a fluff piece: "underfunded" means a total of around $400 million ($330 million in the initial round, $70 million for Gradium). Compare to Elevenlabs who used a $2 million pre-seed for creating their initial product.
A bunch of other stuff there is disingenuous, like comparing their 7B model to Llama-3 405B (hint: the 7B model is a _lot_ dumber). There's also the outright lie: team of 4 made Moshi, which is corrected _in the same piece_ to 8 if you read enough.
With the prompt "WWII Plane Japan Kawasaki Ki-61 flying by, propeller airplane" and setting looping on and 30 sec duration manually instead of auto (the duration predictor fails pretty bad at this prompt, you need to be logged in to set duration manually) it works pretty well. No idea if it's close to that specific airplane though it sounds like a ww2 plane to me though.
It has modern voices, but I prefer the robotic voice from 15 years ago because it's very patterned and predictable, which makes it easier to follow at super-fast speaking rates -- something closer to my visual reading rate.
..plenty of money to be made elsewhere
Audio is too niche and porn is too ethically messy and legally risky.
There's also music, which the giants also don't touch. Suno is actually really impressive.
Wisprflow does not create it's own models but i know willow voice did do extensive finetuning to improve the quality and speed of their transcription models so you may count them.
You get all kinds of weird noises and random words. Jack is often apologetic about the problem you are having with the Hyperion xt5000 smart hub.