TADA: Speech generation through text-acoustic synchronization (opens in new tab)

(hume.ai)

103 pointssmusamashah15d ago27 comments

27 comments

To me, the speech sounds impressively expressive, but there is something off about the audio quality that I can't quite put my finger on.

The "Anger Speech" has an obvious lisp (Maybe a homage to Elmer Fudd?). But I hear a similar, but more subtle, speech impediment in the "Adoration Speech". The "Fearful Speech" might have a slight warble to it. And the "Long Speech" is difficult to evaluate because the speaker has vocal fry to an extent that I find annoying.

sharyphil14d ago

> speaker has vocal fry to an extent that I find annoying.

Was it trained on Sam Altman?

sjcoles14d ago

There's a subtle modulation that happens on all of the samples. It sounds almost like some kind of harmonic or phase shift? This is something I notice with every AI generated speech out there.

mpalmer15d ago

"Long speech" is a faithful synthesis of a fairly irritating modern American English speech pattern.

ggus14d ago

"Vocal fry", aka "creaky voice". It's stereotypically associated with irritating young women.

I like me a good rabbit hole that's interesting and also digs into stereotypes.

Turns out, like many memes, it's not just that. It's (also?) a normal speech pattern, used by different genders, ages, and social groups, in many languages.

This doesn't mean that vocal fry isn't used as social signaling. But complaining about it, well, isn't that social signalling too?

Geoff Lindsey - Vocal Fry: what it is, who does it, and why people hate it! - https://www.youtube.com/watch?v=Q0yL2GezneU

mpalmer14d ago

Not the fry, the cadence that makes everything sound like the same list of three or four things

earthnail15d ago

I don’t understand the approach

> TADA takes a different path. Instead of compressing audio into fewer fixed-rate frames of discrete audio tokens, we align audio representations directly to text tokens — one continuous acoustic vector per text token. This creates a single, synchronized stream where text and speech move in lockstep through the language model.

So basically just concatenating the audio vectors without compression or discretization?

I haven’t read the full paper yet (I know, I should before commenting), but this explanation puzzles me.

yorwba14d ago

It's a variable-rate codec. The audio is still compressed, but by how much depends on the duration of the segment corresponding to a particular text token. The TTS model predicts one audio token per text token and its duration, and the audio decoder fills in a waveform of the appropriate length.

ilaksh14d ago

okay so they say text continuation only without fine tuning. I assume that means that we can't use it as a replacement for TTS in an AI agent chat? Because it will not work without enough context?

Could you maybe trick it into thinking it was continuing a sample for an assistant use case if the sample was generic enough?

I appreciate them being honest about it though because otherwise I might spend two days trying to make it work.

jpcl13d ago

What this means is that it does not support things like acting instructions or creating a voice from a text description. If you prompt it with a matching text+voice sample it will be able to generate more speech based on more text, just like a TTS. It can also generate it's own text on the fly but it won't be as good as your frontier model.

kavalg14d ago

MIT license, supported languages beyond english: ar, ch, de, es, fr, it, ja, pl, pt.

https://huggingface.co/HumeAI/tada-3b-ml

https://github.com/HumeAI/tada

tcbrah15d ago

the 0.09 RTF is wild but i wonder how much of that speed advantage disappears once you need voice cloning or fine grained prosody control. i use cartesia sonic for TTS in a video pipeline and the thing that actually matters for content creation isnt raw speed - its whether you can get consistent emotional delivery across like 50+ scenes without it drifting. the 1:1 text-acoustic alignment should help with hallucinations for sure but does it handle things like mid-sentence pauses or emphasis on specific words? thats where most open source TTS falls apart IMO

regularfry14d ago

Given that it's one-to-one audio and text tokens, you'd get mid-sentence pauses if you just stopped feeding it.

qinqiang20115d ago

Could it run on Macbook? Just on GPU device?

OutOfHere15d ago

Will this run on CPU? (as opposed to GPU)

vessenes14d ago

I could not get my Mac to successfully do anything with the script from their GitHub; set device to mps, downloaded llama for the first time in a year, and it just .. hangs. I presume this is sortable, but I'm not sure I care enough.

All that said, I think it likely this has been built and trained only on Nvidia

microtherion15d ago

This is bound to be a question that will be increasingly harder to answer. For instance, Apple processors have at least two different neural accelerators/matrix coprocessors (ANE and AMX) in addition to the integrated GPU. Do these count as "CPU"?

OutOfHere14d ago

I think the answer is rather simple and boring -- only the CPU type commonly used in cheap cloud machines counts. This still is x86 only.

The homes at home, such as by Apple, don't count for serious workflows that must run reliably.

1 more reply

boxed15d ago

Why would you want to? It's like using a hammer for screws.

g-mork15d ago

CPU compute is infinity times less expensive and much easier to work with in general

1 more reply

regularfry15d ago

To maximise the VRAM available for an LLM on the same machine. That's why I asked myself the same question, anyway.

ranger_danger11d ago

Not everyone has a GPU available that can run this.

j / k navigate · click thread line to collapse

27 comments

microtherion15d ago

To me, the speech sounds impressively expressive, but there is something off about the audio quality that I can't quite put my finger on.

sharyphil14d ago

> speaker has vocal fry to an extent that I find annoying.

Was it trained on Sam Altman?

sjcoles14d ago

There's a subtle modulation that happens on all of the samples. It sounds almost like some kind of harmonic or phase shift? This is something I notice with every AI generated speech out there.

mpalmer15d ago

"Long speech" is a faithful synthesis of a fairly irritating modern American English speech pattern.

ggus14d ago

"Vocal fry", aka "creaky voice". It's stereotypically associated with irritating young women.

I like me a good rabbit hole that's interesting and also digs into stereotypes.

Turns out, like many memes, it's not just that. It's (also?) a normal speech pattern, used by different genders, ages, and social groups, in many languages.

This doesn't mean that vocal fry isn't used as social signaling. But complaining about it, well, isn't that social signalling too?

Geoff Lindsey - Vocal Fry: what it is, who does it, and why people hate it! - https://www.youtube.com/watch?v=Q0yL2GezneU

mpalmer14d ago

Not the fry, the cadence that makes everything sound like the same list of three or four things

earthnail15d ago

I don’t understand the approach

So basically just concatenating the audio vectors without compression or discretization?

I haven’t read the full paper yet (I know, I should before commenting), but this explanation puzzles me.

yorwba14d ago

ilaksh14d ago

okay so they say text continuation only without fine tuning. I assume that means that we can't use it as a replacement for TTS in an AI agent chat? Because it will not work without enough context?

Could you maybe trick it into thinking it was continuing a sample for an assistant use case if the sample was generic enough?

I appreciate them being honest about it though because otherwise I might spend two days trying to make it work.

jpcl13d ago

kavalg14d ago

MIT license, supported languages beyond english: ar, ch, de, es, fr, it, ja, pl, pt.

https://huggingface.co/HumeAI/tada-3b-ml

https://github.com/HumeAI/tada

tcbrah15d ago

regularfry14d ago

Given that it's one-to-one audio and text tokens, you'd get mid-sentence pauses if you just stopped feeding it.

qinqiang20115d ago

Could it run on Macbook? Just on GPU device?

OutOfHere15d ago

Will this run on CPU? (as opposed to GPU)

vessenes14d ago

All that said, I think it likely this has been built and trained only on Nvidia

microtherion15d ago

OutOfHere14d ago

I think the answer is rather simple and boring -- only the CPU type commonly used in cheap cloud machines counts. This still is x86 only.

The homes at home, such as by Apple, don't count for serious workflows that must run reliably.

1 more reply

boxed15d ago

Why would you want to? It's like using a hammer for screws.

g-mork15d ago

CPU compute is infinity times less expensive and much easier to work with in general

1 more reply

regularfry15d ago

To maximise the VRAM available for an LLM on the same machine. That's why I asked myself the same question, anyway.

ranger_danger11d ago

Not everyone has a GPU available that can run this.

j / k navigate · click thread line to collapse