The key here is that the Whisper multilingual ASR model has been trained on a huge amount of data, so its encoder output is a very good representation of the semantic content of speech. This can be used as an open-source, drop-in replacement for the semantic encoder in model architectures like SPEAR-TTS/VALL-E/etc (whose semantic encoders are not publicly available). This is then used to predict acoustic tokens (the output from the quantized/low-bandwidth Encodec audio codec) which is then upsampled/denoised/enhanced with the Vocos vocoder.
I know someone is working on Hindi but it would be great to see this extended to other languages for a properly open-source [1], multilingual TTS platform. I think the main bottleneck at the moment is finding people who can procure/clean compliant datasets.
[0] https://github.com/jpc [1] jpc/Collabora went to great efforts to ensure that they are only using properly licensed data to train this. I doubt Whisper itself was that compliant, so it's a bit muddy.
And the part of that which we use for WhisperSpeech is just the phonetic representation so our model is not able to recreate any of the Whisper training data in any way.
Is that less certain than the quote implies?
Thanks for all the nice comments, I was working really hard on this model for quite a few months now but there are still a lot of ways we can make it better.
Thanks to generosity of Collabora this is a real open-source project (not just a one-time marketing ploy), so if you want to help improve it or integrate it into something you are building, I'd love to help.
You can also buy our undivided engineering attention if you have a business use-case. :)
We're probably interested! We're Overte, an open source VR/Desktop social platform.
The system targets VR and voice chat primarily, but we want to be more accessible to people who can't use voice chat for any reason. We do have an integrated chat, but it's not an ideal experience in VR. So good TTS to make it integrate better would be great for us. And the possibility of doing this without some sort of commercial API that requires keeping a secret API key is huge.
So yeah, we're very much interested in giving this one a try. It will probably take some time as we're gearing up for FOSDEM now, though.
I'm developing a dynamic tutorial desktop app and I plan to use this model as text to speech synthesizer. Any chance it can be ported to ONNX format?
[0] https://github.com/netease-youdao/EmotiVoice
[1] https://github.com/siraben/emotivoice-cli
[2] https://github.com/netease-youdao/EmotiVoice/wiki/Voice-Clon...
What we seem to need is high-quality speech recordings in any language (audiobooks are great) and some recordings for each target language which can be low-quality but need varied prosody/emotions (otherwise everything we generate will sound like an audiobook).
Not sure whether that's enough data for you. (If you need paired text for the LibriVox audiobooks, I can provide you with versions where I "fixed" the original text to match the audiobook content e.g. when someone skipped a line.)
Last year Piper development was supported by Nabu Casa for their "Year of Voice" project for Home Assistant and it sounds like Mike Hansen is going to continue on it with their support this year.
That said if you have a modern NVidia GPU you should be able to run a voice-bot in real-time with WhisperSpeech.
That approach would be useful for things like shifting a voice to a different accent and to support voices speaking multiple languages.
This can be done to a limited extend for models such as MBROLA voices by mapping the phonemes of one language to the phonemes of the MBROLA voice. MBROLA is more complex in that it supports diphones, and many diphone pairs don't exist, so you need to map 3 phonemes together to get the best matching phonetic transcription.
The IPA approach may also make it better to train the phonetic synthesis, given that the IPA vowels are in a formant continuum (similar to colour wheels and cubes). Then, the model could better learn the variations in voice quality and tambre.
There is still a lot to explore in this space – we certainly don't have all the answers yet!
By using the Whisper-derived phonetic representation (so called semantic tokens) we successfully trained a model with just a high-quality speech dataset of one language and the voice quality transferred to English.
Am I weird in just having my head spin - even though I've also been at leading edge tech before, but this is just me yelling at these new algos on my lawn?
Whisper and self-hostable LLMs had a cambrian explosion about 1 year ago, I attended a GPT4 hackathon last March and in 48 hours saw people hook up Speech2Text -> LLM -> Text2Speech pipelines for their live demos. I thought we would all have babelfish by June.
Months later I later attended some conferences with international speakers that really wanted to have live, translated-on-the-fly captions, but there wasn't anything off the shelf they could use. I found a helpful repo to use whisper with rolling transcription but struggled to get the python prerequisites installed (involving hardlinking to a tensorflow repo for my particular version of m1 CPU). It was humbling and also hype-busting to realize that it takes time to productize, and that the LLMs are not magic that can write these applications themselves.
In the meantime even Google hasn't bothered to run the improved transcription models on YouTube videos. They are still old 80% accurate tech that's useless on anyone with an accent.
I agree. I was thinking about making a Jarvis like bot which should be pretty easy at this point. The main problem was that my iPhone doesn’t easily allow for pressing a button upon which it starts listening. You always need to unlock first at which the whole screen gets unlocked too. Maybe these kind of GUI-focussed interfaces are blocking a lot of ideas? At the same time it’s great that people will come up with new devices and these will compete somewhat with phones.
If you're grappling with the slow march from cool tech demos to real-world language model apps, you might wanna check out WhisperLive. It's this rad open-source project that’s all about leveraging Whisper models for slick live transcription. Think real-time, on-the-fly translated captions for those global meetups. It's a neat example of practical, user-focused tech in action. Dive into the details on their GitHub page
> It was humbling and also hype-busting to realize that it takes time to productize
Yep, looks like you found out why it’s taking so long to get this new tech into production. The gap between nothing and a proof of concept is, in some ways, much smaller than the gap between proof of concept and commercial product.
https://captioner.richardson.co.nz/
I would very much like to improve on this but the live translation / captioning still has some more work to go in this space.
Source was here: https://github.com/Rodeoclash/captioner
Like, you had the time to train a bajillion parameter model with a ton of attendant code, but an installation script was a bridge too far. I get that python dependency management sucks, but you had to do it at least once for yourself.
Of course, here I am reinstalling CUdnn for the umpteenth time because this software is provided free of charge and it sprinkles magical fairy dust on my GPU so perhaps I shouldn't whine about it.
I was making a more general statement... I havent even had time to personally look at any voice stuff...
Too many Shiny Things and too much ADHD in the koolaide.
In particular with the generation / recognition abilities of ML models, they have this feature of being a curiosity but not quite useful... so if a speech recognition program goes from 50% accuracy to 75% accuracy it's a huge accomplishment but the program is still approximately as useless when it's done. Going from 98% to 99% accuracy on the other hand still cuts the errors in half, but it's super impressive going from something that's useful but makes mistakes to making half as many mistakes. Once you hit the threshold of minimum usefulness the exponential growth seems like it's sudden and amazing when it's actually been going on for a long time.
At the same time, we've had a few great improvements in methodology with how models are designs (like transformers) and the first iterations showed how impressive things could be but were full of inefficiencies and we're watching those go away rather quickly.
For anyone who hasn't heard of it, this phrase is a reference to the theory of paradigm shifts in scientific progress, introduced in the book "The Structure of Scientific Revolutions" by Thomas Kuhn.
https://en.wikipedia.org/wiki/The_Structure_of_Scientific_Re...
I'm interested in applying TTS to a chat system, and one important feature for that is that there should be as many as possible distinct voices, so that each person would have their own.
Would this, or something else be able to do that?
To check how this works in practice you can check the Google Collab link, at the end we are cloning the voice from a Churchill's speech over radio.
John Madden![1]
[1]: https://knowyourmeme.com/memes/moonbase-alpha-text-to-speech
We are constantly working on these models and we push new versions every two months or so. It should get even better soon. :)
garbage in, garbage out?
For plain old English TTS with a stock voice, there isn't that much of a difference (although Eleven Labs still wins IMO), but if you need either voice cloning or foreign language support, nothing else comes even close.
With that said, Eleven is extremely pricy, something like Azure TTS (which is the best among the cheap options) may be a better fit for less demanding applications.
If you're generating speech once and replaying it many times (e.g. making podcasts), the difference is negligible and you might as well go with Eleven Labs, since it's more customizable and possibly slightly higher quality. If you're doing interactive speech with customers, $9/hr is incredibly expensive (higher than hiring a minimum-wage worker in the U.S.!), and OpenAI's TTS is a very close second best and much more reasonably priced. If you're trying to integrate speech into an AI product, Eleven makes your hourly costs pretty unfeasible since you have to at minimum charge your customers more than it costs to hire a human being to do a task.
Azure's "Neural" line of TTS is the best of the big cloud offerings, but it's pretty mediocre compared to either OpenAI or Eleven Labs IMO. And it's actually more expensive than using OpenAI: it's $0.80 for 50,000 characters (~1hr), unless you're willing to commit to over $1k monthly spend, at which point it's barely cheaper than OpenAI at $0.64 per 50k characters.
OpenAI's TTS is IMO the best option for anything interactive, since it's so much higher quality than Azure's Neural TTS and so much cheaper (with very little quality difference) as compared to Eleven Labs.
I think it should work pretty good with the Apple's MLX framework as well if anyone would be willing to convert it. :)