Audio is the one area small labs are winning (opens in new tab)

(amplifypartners.com)

319 pointsrocauc3mo ago96 comments

96 comments

Taek3mo ago

My understanding is that this is purely a strategic choice by the bigger labs. When OpenAI released Whisper, it was by far best-in-class, and they haven't released any major upgrades since then. It's been 3.5 years... Whisper is older than ChatGPT.

Gemini 3 Pro Preview has superlative audio listening comprehension. If I send it a recording of myself in a car, with me talking, and another passenger talking to the driver, and the radio playing, me in English, the radio in Portuguese, and the driver+passenger in Spanish, Gemini can parse all 4 audio streams as well as other background noises and give a translation for each one, including figuring out which voice belongs to which person, and what everyone's names are (if it's possible to figure that out from the conversation).

I'm sure it would have superlative audio generation capabilities too, if such a feature were enabled.

nickpsecurity3mo ago

Nvidia released Parakeet which claimed superiority. Doesn't negate your point but I did want to add it.

d4rkp4ttern3mo ago

It's amazing how good open-weight STT and TTS have gotten, so there's no need to pay for Wispr Flow, Superwhisper, Eleven-Labs etc.

Sharing my setup in case it may be useful for others; it's especially useful when working with CLI agents like Code Code or Codex-CLI:

STT: Hex [1] (open-source), with Parakeet V3 - stunningly fast, near-instant transcription. The slight accuracy drop relative to bigger models is immaterial when you're talking to an AI. I always ask it to restate back to me what it understood, and it gives back a nicely structured version -- this helps confirm understanding as well as likely helps the CLI agent stay on track. It is a MacOS native app and leverages the CoreML/Neural Engine to get extremely fast transcription (I used to recommend a similar app Handy but it has frequent stuttering issues, and Hex is actually even faster, which I didn't think was possible!)

TTS: Kyutai's Pocket-TTS [2], just 100M params, and amazing speech quality (English only). I made a voice plugin [3] based on this, for Claude Code so it can speak out short updates whenever CC stops. It uses a combination of hooks that nudge the main agent to append a speakable summary, falling back to using a headless agent in case the main agent forgets. Turns out to be surprisingly useful. It's also fun as you can customize the speaking style and mirror your vibe and "colorful language" etc.

The voice plugin gives commands to control it:

    /voice:speak stop
    /voice:speak azelma (change the voice)
    /voice:speak prompt <your arbitrary prompt to control the style>

[1] Hex https://github.com/kitlangton/Hex

[2] Pocket-TTS https://github.com/kyutai-labs/pocket-tts

[3] Voice plugin for Claude Code: https://pchalasani.github.io/claude-code-tools/plugins-detai...

andhuman3mo ago

Same setup I’m using! Parakeet and pocket turbo. It’s feels good enough for daily usage.

freedomben3mo ago

Anyone know of something like Hex that runs on Linux?

d4rkp4ttern3mo ago

Handy is cross-platform, including linux

TheOnlyWayUp3mo ago

+1 for Handy, it's very easy to get running and once it is you don't have to think about it again.

coppsilgold3mo ago

You can roll a script to do this, something that would consume a mic from Pipewire when triggered and then push results to clipboard. With a Parakeet ONNX model in between.

I had cause to do the the opposite: Hotkey -> clipboard TTS

aidenn03mo ago

Is Hex MacOS only?

d4rkp4ttern3mo ago

Yes

nowittyusername3mo ago

Good article and I agree with everything in there. For my own voice agent I decided to make him PTT by default as the problems of the model accurately guessing the end of utterance are just too great. I think it can be solved in the future but, I haven't seen a really good example of it being done with modern day tech including this labs. Fundamentally it all comes down to the fact that different humans have different ways of speaking, and the human listening to them updates their own internal model of the speech pattern. Adjusting their own model after a couple of interactions and arriving at the proper way of speaking with said person. Something very similar will need to be done and at very fast latency's for it to succeed in the audio ml world. But I don't think we have anything like that yet. It seems currently best you can do is tune the model on a generic speech pattern that you expect to fit over a larger percentage of the human population and that's about the best you can do, anyone who falls outside of that will feel the pain of getting interrupted every time.

spuz3mo ago

Check out Sparrow-0. The demo shows an impressive ability to predict when the speaker has finished talking:

https://www.tavus.io/post/sparrow-0-advancing-conversational...

nowittyusername3mo ago

Thanks, ill read it now.

KurSix3mo ago

It feels like this is one of those areas where the last 10% of polish will take 90% of the effort

x_may3mo ago

The 80/20 rule always wins

dkarp3mo ago

There's too much noise at large organizations

United8573mo ago

I see what you did there.

echelon3mo ago

They're focused on soaking up big money first.

They'll optimize down the stack once they've sucked all the oxygen out of the room.

Little players won't be able to grow through the ceiling the giants create.

etherus3mo ago

Why would they do that? Once they have their win-condition, there's no reason to innovate. Only to reduce the costs of existing solutions. I expect that unless voice becomes a parameter which drives competition and first-choice for adoption, it will never become a focus of the frontier orgs. Which is curious to me, as almost the opposite of how I'm reading your comment.

qoez3mo ago

OpenAI and google are too scared of music industry lawyers to tackle this. Internally they without a doubt have models that would crush these startups over night if they chose to release them.

WJW3mo ago

Is your claim that music industry lawyers are that much scarier than movie industry lawyers? Because the big labs don't seem to have any problem releasing models that create (possibly infringing) video.

spacebanana73mo ago

The movie industry is doing well from AI.

Thus far AI has only been used to create fan fiction clips that generate free marketing for legacy IP on TikTok. And the rights holders know that if AI gets good enough to make feature length movies then they'll be able to aggressively use various legal mechanisms to take the videos off major sites and pursue the creators. Long term it could potentially lower internal production costs by getting rid of actors & writers.

Music is very different. The production cost is already zero, and people generating their own Taylor Swift songs is a real competitive threat to Spotify etc.

raincole3mo ago

Just right now: ByteDance to curb AI video app after Disney legal threat

https://www.bbc.com/news/articles/c93wq6xqgy1o

aleph_minus_one3mo ago

> Is your claim that music industry lawyers are that much scarier than movie industry lawyers?

Not qoez:

You have to balance market opportunities with the risk of reputational damage and litigation risk.

Video will probably make a lot more money than audio, so you are willing to take a bigger risk. Additionally, at least for Google there exists a strong synergy between their video generation models and YouTube, which makes it even more sensible for Google to make video models available to the public despite these risks.

throawayonthe3mo ago

well i guess the music industry is a lot more monopolized than video, plus there is a lot of video out there that isn't "movies," while there's not a lot of music that isn't... "music"

amelius3mo ago

What about Disney's lawyers? GenAI for images exists ...

calebhwin3mo ago

Disney is actually quite excited about GenAI [0]

[0] https://openai.com/index/disney-sora-agreement/

1 more reply

KurSix3mo ago

I'm not sure it's just fear of lawyers, although that's definitely part of it. Big companies have way more to lose reputationally and legally, so the bar for releasing something is much higher

giancarlostoro3mo ago

OpenAI being the death star and audio AI being the rebels is such a weird comparison, like what? Wouldn't the real rebels be the ones running their own models locally?

tgv3mo ago

Audio AI companies are just another death star, intent on reducing human creativity to "make a song like Let it be, but in the style of Eminem, and change the lyrics to match the birthday of my mother in law". The only rebels are musicians resisting this hedge-fund driven monstrosity.

tl2do3mo ago

True, but there's a fun irony: the Rebels' X-Wings are powered by GPUs from a company that's... checks relationships ...also supplying the Empire.

NVIDIA's basically the galaxy's most successful arms dealer, selling to both sides while convincing everyone they're just "enabling innovation." The real rebels would be training audio models on potato-patched RP2040s. Brave souls, if they exist.

tadfisher3mo ago

The company behind the T-65B X-Wing, Incom Corporation, did supply the Empire, as they did the Republic Navy before. By 0 BBY, Incom was nationalized by the Imperials. The X-Wing became the mainstay Alliance fighter because the plans were stolen by some defecting Incom engineers.

djmips3mo ago

wonder why the Empire never ran any black x wings

garyfirestorm3mo ago

not sure about the irony - you can't really expect rebels to start their own weapons manufacturing lab right from converting ore into steel... these things are often supplied by a large manufacturer (which is often a monopoly) why is it any different for a startup to tap into nvidia's proverbial shovel in order to start digging for gold?

1 more reply

wavemode3mo ago

I had a different issue with the metaphor - shouldn't OpenAI be the empire? The death star would be the thing they created, i.e. ChatGPT.

lukax3mo ago

Never any mention of Soniox and they are on the Pareto frontier[1]

https://www.daily.co/blog/benchmarking-stt-for-voice-agents/

beaker523mo ago

There’s simply not enough of a market for these bigger orgs to be truly interested/invested in audio, video and even image to an extent.

They’ll wait for progress to be made and then buy the capability/expertise/talent when the time is right.

d4rkp4ttern3mo ago

Speaking of audio + AI, here's a "learning hack" I've been trying with voice mode, and the 3 big AI labs still haven't nailed it:

While on a walk with mobile phone + earphones, dump an article/paper/HN-Post/github-repo into the mobile chat app (chat-gpt, claude or gemini), and use voice mode to have it walk you through it conversationally, so you can ask follow up questions during the walk-thru and the AI would do web-search etc. I know I could do something like this with NotebookLM, but I want to engage in the conversation, and NotebookLM does have interactive mode but it has been super-flaky to say the least.

I pay for ChatGPT Pro and the voice mode is really bad: it pretends to do web searches and makes up things, and when pushed says it didn't actually read the article. Also the voice sounds super-condescending.

Gemini Pro mobile app - similarly refuses to open links and sounds as if it's talking to a baby.

Claude mobile app was the best among these - the voice is very tolerable in terms of tone, but like the others it can't open links. I does do web searches, but gets some type of summaries of pages, and it doesn't actually go into the links themselves to give me details.

TheTaytay3mo ago

I have found that the "advanced voice mode" is dumb as a box of rocks compared to their "basic" TTS version, so I disable it. I've switched to Claude, so I don't know if that's still an option, but if you are tied to ChatGPT, definitely disable it if possible!

AustinDev3mo ago

Audio models are also tiny, which is probably why small labs are doing well in the space. I run a LoRA'd Whisper v3 Large for a client. We can fit 4 versions of the model in memory at once on a ~$1/hr A10 and have half the VRAM leftover.

Each of the LoRA tunes we did took maybe 2-3 hours on the same A10 instance.

freedomben3mo ago

Is Whisper still getting nontrivial development? I was under the impression that it had stagnated, but it seems hard to find more than just rumors

AustinDev3mo ago

My ~1.7% WER and faster than realtime processing in my application make it more than adequate. My application is multi-speaker with WPM rates >300 for long durations.

ilaksh3mo ago

I check every day for a new full-duplex model. I was so hyped about PersonaPlex from their demos, but in my test it was oddly dumb and unable to follow instructions.

So I am hoping for something like PersonaPlex but a bit larger.

Has anyone tested MiniCPM-o?.How is it at instruction following?

pain_perdu3mo ago

It's actively under development. Do you have a particular use-case in mind?

ilaksh3mo ago

Outgoing phone calls.

shenberg3mo ago

Moshi was an amazing tech demo, building the entire stack from scratch in 6 months with a small team was an amazing show of skill: 7B text LLM data + training, emotive TTS for synth data generation (again model + data collection), synth data pipeline, novel speech codec, rust inference stack for low latency, audio LLM architecture incl. text "thoughts" stream which was novel.

But, this piece is a fluff piece: "underfunded" means a total of around $400 million ($330 million in the initial round, $70 million for Gradium). Compare to Elevenlabs who used a $2 million pre-seed for creating their initial product.

A bunch of other stuff there is disingenuous, like comparing their 7B model to Llama-3 405B (hint: the 7B model is a _lot_ dumber). There's also the outright lie: team of 4 made Moshi, which is corrected _in the same piece_ to 8 if you read enough.

Miraltar3mo ago

Stopped reading there: "This model (Moshi) could [...] recite an original poem in a French accent (research shows poems sound better this way)."

bossyTeacher3mo ago

Surprised ElevenLabs is not mentioned

gorgoiler3mo ago

I was suspicious that they are not mentioned, but then I realized this is a VC opinion piece and the first company mentioned joined their portfolio last year.

krackers3mo ago

Also 15.ai [1]

[1] https://en.wikipedia.org/wiki/15.ai

hadlock3mo ago

Can someone reccomend to me: a service that will generate a loopable engine drone for a "WWII Plane Japan Kawasaki Ki-61"? It doesn't have to be perfect, just convincing in a hollywood blockbuster context, and not just a warmed over clone of a Merlin engine sound. Turns out Suno will make whatever background music I need, but I want a "unique sound effect on demand" service. I'm not convinced voice AI stuff is sustainable

almostdigital3mo ago

https://elevenlabs.io/sound-effects

With the prompt "WWII Plane Japan Kawasaki Ki-61 flying by, propeller airplane" and setting looping on and 30 sec duration manually instead of auto (the duration predictor fails pretty bad at this prompt, you need to be logged in to set duration manually) it works pretty well. No idea if it's close to that specific airplane though it sounds like a ww2 plane to me though.

Imustaskforhelp3mo ago

are there any open source alternatives to this as well if I may ask?

djmips3mo ago

A fun alternative could be using a physically based engine sound synthesizer - for example - https://github.com/Engine-Simulator/engine-sim-community-edi...

nextaccountic3mo ago

you mean you want some ai product that generates sound effects from a textual prompt? elevenlabs has a model specifically for that

https://elevenlabs.io/sound-effects

KurSix3mo ago

Most current "voice assistants" still feel like glorified walkie-talkies... you talk, pause awkwardly, they respond, and any interruption breaks the flow

fainpul3mo ago

Is there something that will read books to me? I.e I have some books in epub format and want audiobook versions for them, with a nice voice.

nerevarthelame3mo ago

I use the standard iPhone TTS accessibility tool and an iOS ebook reader.

It has modern voices, but I prefer the robotic voice from 15 years ago because it's very patterned and predictable, which makes it easier to follow at super-fast speaking rates -- something closer to my visual reading rate.

bilater3mo ago

You can generate your own audiobook (single voice or multi) https://www.plainscribe.com

yismail3mo ago

ElevenReader works well and has a pretty good free plan

joshmlewis3mo ago

Speechify has been good for me although there might be better / cheaper alternatives I'm not aware of.

gmerc3mo ago

Maybe OpenAI has finally learned that dancing on all the parties at once, when all parties are progressing towards "commodity".

amelius3mo ago

Probably because the big companies have their focus elsewhere.

mrbluecoat3mo ago

The bigger players probably avoid it because it's a bigger legal liability: https://news.ycombinator.com/item?id=47025864

..plenty of money to be made elsewhere

lysace3mo ago

Also: porn.

Audio is too niche and porn is too ethically messy and legally risky.

There's also music, which the giants also don't touch. Suno is actually really impressive.

SilverElfin3mo ago

Does Wisprflow count as an audio “lab”?

carshodev3mo ago

Transcription providers like wisprflow and willow voice are typically providing nice UI/UX around open source models.

Wisprflow does not create it's own models but i know willow voice did do extensive finetuning to improve the quality and speed of their transcription models so you may count them.

znnajdla3mo ago

Right now small labs also have the best chance at tool harness improvements, which can yield just as many gains in AI performance as model training research.

RobMurray3mo ago

for a laugh enter nonsense at https://gradium.ai/

You get all kinds of weird noises and random words. Jack is often apologetic about the problem you are having with the Hyperion xt5000 smart hub.

pain_perdu3mo ago

Hey Rob. I'm not on the tech team here at Gradium (I do GTM) but still curious where you found the glitch? Were you entering words into the STT in the bottom of the front page? Can you share an example so I can replicate? Many thanks!

j / k navigate · click thread line to collapse

96 comments

Taek3mo ago

I'm sure it would have superlative audio generation capabilities too, if such a feature were enabled.

nickpsecurity3mo ago

Nvidia released Parakeet which claimed superiority. Doesn't negate your point but I did want to add it.

d4rkp4ttern3mo ago

It's amazing how good open-weight STT and TTS have gotten, so there's no need to pay for Wispr Flow, Superwhisper, Eleven-Labs etc.

Sharing my setup in case it may be useful for others; it's especially useful when working with CLI agents like Code Code or Codex-CLI:

The voice plugin gives commands to control it:

    /voice:speak stop
    /voice:speak azelma (change the voice)
    /voice:speak prompt <your arbitrary prompt to control the style>

[1] Hex https://github.com/kitlangton/Hex

[2] Pocket-TTS https://github.com/kyutai-labs/pocket-tts

[3] Voice plugin for Claude Code: https://pchalasani.github.io/claude-code-tools/plugins-detai...

andhuman3mo ago

Same setup I’m using! Parakeet and pocket turbo. It’s feels good enough for daily usage.

freedomben3mo ago

Anyone know of something like Hex that runs on Linux?

d4rkp4ttern3mo ago

Handy is cross-platform, including linux

TheOnlyWayUp3mo ago

+1 for Handy, it's very easy to get running and once it is you don't have to think about it again.

coppsilgold3mo ago

You can roll a script to do this, something that would consume a mic from Pipewire when triggered and then push results to clipboard. With a Parakeet ONNX model in between.

I had cause to do the the opposite: Hotkey -> clipboard TTS

aidenn03mo ago

Is Hex MacOS only?

d4rkp4ttern3mo ago

Yes

nowittyusername3mo ago

spuz3mo ago

Check out Sparrow-0. The demo shows an impressive ability to predict when the speaker has finished talking:

https://www.tavus.io/post/sparrow-0-advancing-conversational...

nowittyusername3mo ago

Thanks, ill read it now.

KurSix3mo ago

It feels like this is one of those areas where the last 10% of polish will take 90% of the effort

x_may3mo ago

The 80/20 rule always wins

dkarp3mo ago

There's too much noise at large organizations

United8573mo ago

I see what you did there.

echelon3mo ago

They're focused on soaking up big money first.

They'll optimize down the stack once they've sucked all the oxygen out of the room.

Little players won't be able to grow through the ceiling the giants create.

etherus3mo ago

qoez3mo ago

OpenAI and google are too scared of music industry lawyers to tackle this. Internally they without a doubt have models that would crush these startups over night if they chose to release them.

WJW3mo ago

spacebanana73mo ago

The movie industry is doing well from AI.

Music is very different. The production cost is already zero, and people generating their own Taylor Swift songs is a real competitive threat to Spotify etc.

raincole3mo ago

Just right now: ByteDance to curb AI video app after Disney legal threat

https://www.bbc.com/news/articles/c93wq6xqgy1o

aleph_minus_one3mo ago

> Is your claim that music industry lawyers are that much scarier than movie industry lawyers?

Not qoez:

You have to balance market opportunities with the risk of reputational damage and litigation risk.

throawayonthe3mo ago

well i guess the music industry is a lot more monopolized than video, plus there is a lot of video out there that isn't "movies," while there's not a lot of music that isn't... "music"

amelius3mo ago

What about Disney's lawyers? GenAI for images exists ...

calebhwin3mo ago

Disney is actually quite excited about GenAI [0]

[0] https://openai.com/index/disney-sora-agreement/

1 more reply

KurSix3mo ago

I'm not sure it's just fear of lawyers, although that's definitely part of it. Big companies have way more to lose reputationally and legally, so the bar for releasing something is much higher

giancarlostoro3mo ago

OpenAI being the death star and audio AI being the rebels is such a weird comparison, like what? Wouldn't the real rebels be the ones running their own models locally?

tgv3mo ago

tl2do3mo ago

True, but there's a fun irony: the Rebels' X-Wings are powered by GPUs from a company that's... checks relationships ...also supplying the Empire.

tadfisher3mo ago

djmips3mo ago

wonder why the Empire never ran any black x wings

garyfirestorm3mo ago

1 more reply

wavemode3mo ago

I had a different issue with the metaphor - shouldn't OpenAI be the empire? The death star would be the thing they created, i.e. ChatGPT.

lukax3mo ago

Never any mention of Soniox and they are on the Pareto frontier[1]

https://www.daily.co/blog/benchmarking-stt-for-voice-agents/

beaker523mo ago

There’s simply not enough of a market for these bigger orgs to be truly interested/invested in audio, video and even image to an extent.

They’ll wait for progress to be made and then buy the capability/expertise/talent when the time is right.

d4rkp4ttern3mo ago

Speaking of audio + AI, here's a "learning hack" I've been trying with voice mode, and the 3 big AI labs still haven't nailed it:

Gemini Pro mobile app - similarly refuses to open links and sounds as if it's talking to a baby.

TheTaytay3mo ago

AustinDev3mo ago

Each of the LoRA tunes we did took maybe 2-3 hours on the same A10 instance.

freedomben3mo ago

Is Whisper still getting nontrivial development? I was under the impression that it had stagnated, but it seems hard to find more than just rumors

AustinDev3mo ago

My ~1.7% WER and faster than realtime processing in my application make it more than adequate. My application is multi-speaker with WPM rates >300 for long durations.

ilaksh3mo ago

I check every day for a new full-duplex model. I was so hyped about PersonaPlex from their demos, but in my test it was oddly dumb and unable to follow instructions.

So I am hoping for something like PersonaPlex but a bit larger.

Has anyone tested MiniCPM-o?.How is it at instruction following?

pain_perdu3mo ago

It's actively under development. Do you have a particular use-case in mind?

ilaksh3mo ago

Outgoing phone calls.

shenberg3mo ago

Miraltar3mo ago

Stopped reading there: "This model (Moshi) could [...] recite an original poem in a French accent (research shows poems sound better this way)."

bossyTeacher3mo ago

Surprised ElevenLabs is not mentioned

gorgoiler3mo ago

I was suspicious that they are not mentioned, but then I realized this is a VC opinion piece and the first company mentioned joined their portfolio last year.

krackers3mo ago

Also 15.ai [1]

[1] https://en.wikipedia.org/wiki/15.ai

hadlock3mo ago

almostdigital3mo ago

https://elevenlabs.io/sound-effects

Imustaskforhelp3mo ago

are there any open source alternatives to this as well if I may ask?

djmips3mo ago

A fun alternative could be using a physically based engine sound synthesizer - for example - https://github.com/Engine-Simulator/engine-sim-community-edi...

nextaccountic3mo ago

you mean you want some ai product that generates sound effects from a textual prompt? elevenlabs has a model specifically for that

https://elevenlabs.io/sound-effects

KurSix3mo ago

Most current "voice assistants" still feel like glorified walkie-talkies... you talk, pause awkwardly, they respond, and any interruption breaks the flow

fainpul3mo ago

Is there something that will read books to me? I.e I have some books in epub format and want audiobook versions for them, with a nice voice.

nerevarthelame3mo ago

I use the standard iPhone TTS accessibility tool and an iOS ebook reader.

bilater3mo ago

You can generate your own audiobook (single voice or multi) https://www.plainscribe.com

yismail3mo ago

ElevenReader works well and has a pretty good free plan

joshmlewis3mo ago

Speechify has been good for me although there might be better / cheaper alternatives I'm not aware of.

gmerc3mo ago

Maybe OpenAI has finally learned that dancing on all the parties at once, when all parties are progressing towards "commodity".

amelius3mo ago

Probably because the big companies have their focus elsewhere.

mrbluecoat3mo ago

The bigger players probably avoid it because it's a bigger legal liability: https://news.ycombinator.com/item?id=47025864

..plenty of money to be made elsewhere

lysace3mo ago

Also: porn.

Audio is too niche and porn is too ethically messy and legally risky.

There's also music, which the giants also don't touch. Suno is actually really impressive.

SilverElfin3mo ago

Does Wisprflow count as an audio “lab”?

carshodev3mo ago

Transcription providers like wisprflow and willow voice are typically providing nice UI/UX around open source models.

Wisprflow does not create it's own models but i know willow voice did do extensive finetuning to improve the quality and speed of their transcription models so you may count them.

znnajdla3mo ago

Right now small labs also have the best chance at tool harness improvements, which can yield just as many gains in AI performance as model training research.

RobMurray3mo ago

for a laugh enter nonsense at https://gradium.ai/

You get all kinds of weird noises and random words. Jack is often apologetic about the problem you are having with the Hyperion xt5000 smart hub.

pain_perdu3mo ago

j / k navigate · click thread line to collapse