undefined | Better HN

0 pointspotatolicious3y ago0 comments

It's an extremely hard problem that lots of people are working on.

The trick is that we have "pretty good" results for TTS as-is, but it has significant shortcomings that are more visible in certain use cases. The operative word is "prosody" - the cadence, rhythm, and pauses that are natural when speaking that are heavily dependent on context and content.

Prosody is incredibly important to making natural utterances - TTS models that do not model prosody end up sounding very "flat", which is mostly all of the heavily used TTS engines out there right now. This is less glaring for short responses like what you would get from a voice assistant, but becomes a huge grating problem when you try to do long-form text reading.

The trick with prosody is that it often requires information and context not contained in the text to be read. You would apply a different rhythm and stresses to a horror story than you would to a conference keynote speech, for example. It also requires a more sophisticated understanding of the content of text rather than simply its constituent words, in order to figure out proper stresses and pauses.

All of this is eminently solvable (as demonstrated here with the book voices) but is... rather difficult. I suspect we're not terribly close to a product where you can just feed it raw text (with annotating or otherwise providing additional data as context) and get a great result.

0 comments

lilyball3y ago

I wonder how effective it would be to feed the book to some other AI model first that reads the whole thing and figures out the necessary context that it can then go back and feed into the TTS model

j / k navigate · click thread line to collapse

0 pointspotatolicious3y ago0 comments

It's an extremely hard problem that lots of people are working on.