The "Anger Speech" has an obvious lisp (Maybe a homage to Elmer Fudd?). But I hear a similar, but more subtle, speech impediment in the "Adoration Speech". The "Fearful Speech" might have a slight warble to it. And the "Long Speech" is difficult to evaluate because the speaker has vocal fry to an extent that I find annoying.
Was it trained on Sam Altman?
I like me a good rabbit hole that's interesting and also digs into stereotypes.
Turns out, like many memes, it's not just that. It's (also?) a normal speech pattern, used by different genders, ages, and social groups, in many languages.
This doesn't mean that vocal fry isn't used as social signaling. But complaining about it, well, isn't that social signalling too?
Geoff Lindsey - Vocal Fry: what it is, who does it, and why people hate it! - https://www.youtube.com/watch?v=Q0yL2GezneU
> TADA takes a different path. Instead of compressing audio into fewer fixed-rate frames of discrete audio tokens, we align audio representations directly to text tokens — one continuous acoustic vector per text token. This creates a single, synchronized stream where text and speech move in lockstep through the language model.
So basically just concatenating the audio vectors without compression or discretization?
I haven’t read the full paper yet (I know, I should before commenting), but this explanation puzzles me.
Could you maybe trick it into thinking it was continuing a sample for an assistant use case if the sample was generic enough?
I appreciate them being honest about it though because otherwise I might spend two days trying to make it work.
All that said, I think it likely this has been built and trained only on Nvidia
The homes at home, such as by Apple, don't count for serious workflows that must run reliably.