Skip to content

Top New Best Ask Show Jobs

WhisperSpeech – An open source text-to-speech system built by inverting Whisper | Better HN

WhisperSpeech – An open source text-to-speech system built by inverting Whisper (opens in new tab)

(github.com)

464 pointsnickmcc2y ago114 comments

114 comments

I've been following jpc [0] on the LAION discord since he started building this last year, and it's a very impressive project.

The key here is that the Whisper multilingual ASR model has been trained on a huge amount of data, so its encoder output is a very good representation of the semantic content of speech. This can be used as an open-source, drop-in replacement for the semantic encoder in model architectures like SPEAR-TTS/VALL-E/etc (whose semantic encoders are not publicly available). This is then used to predict acoustic tokens (the output from the quantized/low-bandwidth Encodec audio codec) which is then upsampled/denoised/enhanced with the Vocos vocoder.

I know someone is working on Hindi but it would be great to see this extended to other languages for a properly open-source [1], multilingual TTS platform. I think the main bottleneck at the moment is finding people who can procure/clean compliant datasets.

[0] https://github.com/jpc [1] jpc/Collabora went to great efforts to ensure that they are only using properly licensed data to train this. I doubt Whisper itself was that compliant, so it's a bit muddy.

jpcl2y ago

Yeah, Whisper is not clear-cut but since it is not a generative model I think their data usage is a lot more likely to be considered fair-use.

And the part of that which we use for WhisperSpeech is just the phonetic representation so our model is not able to recreate any of the Whisper training data in any way.

leereeves2y ago

The readme says "We are working only with properly licensed speech recordings and all the code is Open Source so the model will be always safe to use for commercial applications."

Is that less certain than the quote implies?

They don't mention the ability to add custom voices to the speech output, I wonder if that's a feature thatbwould be supported

atwrk2y ago

They do mention voice cloning in the README ("We’ve also added an example of voice cloning based on a reference audio file."), do you have something different in mind?

jpcl2y ago

Hi, WhisperSpeech dev here.

Thanks for all the nice comments, I was working really hard on this model for quite a few months now but there are still a lot of ways we can make it better.

Thanks to generosity of Collabora this is a real open-source project (not just a one-time marketing ploy), so if you want to help improve it or integrate it into something you are building, I'd love to help.

You can also buy our undivided engineering attention if you have a business use-case. :)

dale_glass2y ago

> Thanks to generosity of Collabora this is a real open-source project (not just a one-time marketing ploy), so if you want to help improve it or integrate it into something you are building, I'd love to help.

We're probably interested! We're Overte, an open source VR/Desktop social platform.

The system targets VR and voice chat primarily, but we want to be more accessible to people who can't use voice chat for any reason. We do have an integrated chat, but it's not an ideal experience in VR. So good TTS to make it integrate better would be great for us. And the possibility of doing this without some sort of commercial API that requires keeping a secret API key is huge.

So yeah, we're very much interested in giving this one a try. It will probably take some time as we're gearing up for FOSDEM now, though.

jpcl2y ago

Yeah, we'd love to help you when you decide to give it a try so feel free to reach out. We also have quite a few people working on VR at Collabora.

mastayoda2y ago

Great job! Thanks for sharing this!

I'm developing a dynamic tutorial desktop app and I plan to use this model as text to speech synthesizer. Any chance it can be ported to ONNX format?

Interested to see how it performs for Mandarin Chinese speech synthesis, especially with prosody and emotion. The highest quality open source model I've seen so far is EmotiVoice[0], which I've made a CLI wrapper around to generate audio for flashcards.[1] For EmotiVoice, you can apparently also clone your own voice with a GPU, but I have not tested this.[2]

[0] https://github.com/netease-youdao/EmotiVoice

[1] https://github.com/siraben/emotivoice-cli

[2] https://github.com/netease-youdao/EmotiVoice/wiki/Voice-Clon...

jpcl2y ago

Hi, WhisperSpeech dev here, we only support Polish and English at the moment but we just finished doing some inference optimizations and are looking to add more languages.

What we seem to need is high-quality speech recordings in any language (audiobooks are great) and some recordings for each target language which can be low-quality but need varied prosody/emotions (otherwise everything we generate will sound like an audiobook).

yorwba2y ago

Last I checked, LibriVox had about 11 hours of Mandarin audiobooks and Common Voice has 234 validated hours of "Chinese (China)" (probably corresponding to Mandarin as spoken on the mainland paired with text in Simplified characters, but who knows) and 77 validated hours of "Chinese (Taiwan)" (probably Taiwanese Mandarin paired with Traditional characters).

Not sure whether that's enough data for you. (If you need paired text for the LibriVox audiobooks, I can provide you with versions where I "fixed" the original text to match the audiobook content e.g. when someone skipped a line.)

colordrops2y ago

Just listened to the demo voices for EmotiVoice and WhisperSpeech. I think WhisperSpeech edges out EmotiVoice. EmotiVoice sounds like it was trained on English spoken by non-native speakers.

Did you try XTTS v2 for Mandarin? I'm curious how it compares with EmotiVoice.

thorum2y ago

It has a big problem with hallucination in Chinese, random extra syllables all over the place.

Have you released your flashcard app?

If you're interested, I have a small side project (https://imaginanki.com) for generating Anki decks with images + speech (via SDXL/Azure).

knubie2y ago

Not OP, but I develop Mochi [0] which is a spaced repetition flash card app that has text-to-speech and a bunch of other stuff built in (transcription, dictionaries, etc.) that you might be interested in.

[0] https://mochi.cards

It's just an Anki deck.

I know it's old at this point and doesn't use the fancy new tech, but Mycroft's Mimic 3 is still pretty impressive and is small enough to fit comfortably and generate speech in real time on a raspberry pi [0]. Some of their voices are better than others, but the best of them are definitely equal to the examples of WhisperSpeech given here.

[0] https://mycroft.ai/mimic-3/

If you're not already aware, the primary developer of Mimic 3 (and its non-Mimic predecessor Larynx) continued TTS-related development with Larynx and the renamed project Piper: https://github.com/rhasspy/piper

Last year Piper development was supported by Nabu Casa for their "Year of Voice" project for Home Assistant and it sounds like Mike Hansen is going to continue on it with their support this year.

That is much better quality than Mimic (which didn't sound very close to WhisperSpeech to me).

Wow, I did not know that! Thank you!

jpcl2y ago

Yeah, the Mimic is a lot less resource intensive. We are working to improve WhisperSpeech in this regard but it's probably always going to require more compute (but in return you'll get higher quality).

That said if you have a modern NVidia GPU you should be able to run a voice-bot in real-time with WhisperSpeech.

freedomben2y ago

Will something like whisper.cpp be possible for whisper speech?

boxed2y ago

The "English US" voice sounds more Scottish than American to me :P

rcarmo2y ago

Which might be a good thing, nae, laddie?

nickmccOP2y ago

I was looking at video on training a custom voice with Piper, following a tutorial at https://www.youtube.com/watch?v=b_we_jma220, and noticed how the datasets required metadata of the text for the source audio files. This training method by Collabora seems to automate that process and only requires an audio file for training.

jpcl2y ago

Yup, we are using Whisper to transcribe automatically so we can train the model on just speech recordings, without human transcripts.

This works for any language that is well supported by the OpenAI Whisper model.

deskamess2y ago

Where can we find the latest OpenAI language model rankings?

gmerc2y ago

Whisper solves it, that’s its purpose.

rhdunn2y ago

Is there any work/progress on a NN/trained model based on Internet Phonetic Alphabet (IPA) transcriptions? I.e. to be able to create an IPA transcription and convert it back to sound.

That approach would be useful for things like shifting a voice to a different accent and to support voices speaking multiple languages.

This can be done to a limited extend for models such as MBROLA voices by mapping the phonemes of one language to the phonemes of the MBROLA voice. MBROLA is more complex in that it supports diphones, and many diphone pairs don't exist, so you need to map 3 phonemes together to get the best matching phonetic transcription.

The IPA approach may also make it better to train the phonetic synthesis, given that the IPA vowels are in a formant continuum (similar to colour wheels and cubes). Then, the model could better learn the variations in voice quality and tambre.

jpcl2y ago

That's an interesting thought. The semantic tokens we get from Whisper serve a similar purpose – you can convert existing speech to different voices, I did not try with accents yet.

There is still a lot to explore in this space – we certainly don't have all the answers yet!

The Polish sample is really good. Sounds like an audiobook recording.

jpcl2y ago

Both Polish and English samples are actually synthesized with a voice trained on the WolneLektury audiobooks. They are the highest quality open source (CC BY-SA) audiobooks I could find.

By using the Whisper-derived phonetic representation (so called semantic tokens) we successfully trained a model with just a high-quality speech dataset of one language and the voice quality transferred to English.

satvikpendem2y ago

How much training compute does it require to train from scratch? I'm wondering because I have a lot of audiobooks, they're not necessarily CC licensed though but for my private usage and training I think it'd be fine.

e12e2y ago

Link to these in English? I found some hits that may be correct for Polish - but I'm guessing they're hosted somewhere canonical?

Aside: is it just me, or is anyone else just as dumbfounded with how quickly literally every aspect of AI and LLMs and Models and blah blah blah is going?

Am I weird in just having my head spin - even though I've also been at leading edge tech before, but this is just me yelling at these new algos on my lawn?

jazzyjackson2y ago

on the contrary I'm really disappointed in how long its taking anything to get into production.

Whisper and self-hostable LLMs had a cambrian explosion about 1 year ago, I attended a GPT4 hackathon last March and in 48 hours saw people hook up Speech2Text -> LLM -> Text2Speech pipelines for their live demos. I thought we would all have babelfish by June.

Months later I later attended some conferences with international speakers that really wanted to have live, translated-on-the-fly captions, but there wasn't anything off the shelf they could use. I found a helpful repo to use whisper with rolling transcription but struggled to get the python prerequisites installed (involving hardlinking to a tensorflow repo for my particular version of m1 CPU). It was humbling and also hype-busting to realize that it takes time to productize, and that the LLMs are not magic that can write these applications themselves.

In the meantime even Google hasn't bothered to run the improved transcription models on YouTube videos. They are still old 80% accurate tech that's useless on anyone with an accent.

> on the contrary I'm really disappointed in how long its taking anything to get into production.

I agree. I was thinking about making a Jarvis like bot which should be pretty easy at this point. The main problem was that my iPhone doesn’t easily allow for pressing a button upon which it starts listening. You always need to unlock first at which the whole screen gets unlocked too. Maybe these kind of GUI-focussed interfaces are blocking a lot of ideas? At the same time it’s great that people will come up with new devices and these will compete somewhat with phones.

vineet2022y ago

Check out WhisperLive: https://github.com/collabora/WhisperLive

If you're grappling with the slow march from cool tech demos to real-world language model apps, you might wanna check out WhisperLive. It's this rad open-source project that’s all about leveraging Whisper models for slick live transcription. Think real-time, on-the-fly translated captions for those global meetups. It's a neat example of practical, user-focused tech in action. Dive into the details on their GitHub page

taneq2y ago

> I'm really disappointed in how long its taking anything to get into production.

> It was humbling and also hype-busting to realize that it takes time to productize

Yep, looks like you found out why it’s taking so long to get this new tech into production. The gap between nothing and a proof of concept is, in some ways, much smaller than the gap between proof of concept and commercial product.

ricketycricket2y ago

I built this last March. It captures audio from a live HLS stream and transcribes and translates into 18 languages on the fly. Used by a customer with about 25K international employees for their internal events. Works surprisingly well.

Rodeoclash2y ago

I'd be interested if you ever dig anything up for this. I hacked together a kind of crude tool to snapshot audio and translate / caption it on the fly:

https://captioner.richardson.co.nz/

I would very much like to improve on this but the live translation / captioning still has some more work to go in this space.

Source was here: https://github.com/Rodeoclash/captioner

I have a similar frustration with the lack of tooling around all this stuff.

Like, you had the time to train a bajillion parameter model with a ton of attendant code, but an installation script was a bridge too far. I get that python dependency management sucks, but you had to do it at least once for yourself.

Of course, here I am reinstalling CUdnn for the umpteenth time because this software is provided free of charge and it sprinkles magical fairy dust on my GPU so perhaps I shouldn't whine about it.

You're focuded on whisper/voice stuff...

I was making a more general statement... I havent even had time to personally look at any voice stuff...

Too many Shiny Things and too much ADHD in the koolaide.

colechristensen2y ago

This is the structure of revolutions, particularly of this kind. Exponential growth looks like this.

In particular with the generation / recognition abilities of ML models, they have this feature of being a curiosity but not quite useful... so if a speech recognition program goes from 50% accuracy to 75% accuracy it's a huge accomplishment but the program is still approximately as useless when it's done. Going from 98% to 99% accuracy on the other hand still cuts the errors in half, but it's super impressive going from something that's useful but makes mistakes to making half as many mistakes. Once you hit the threshold of minimum usefulness the exponential growth seems like it's sudden and amazing when it's actually been going on for a long time.

At the same time, we've had a few great improvements in methodology with how models are designs (like transformers) and the first iterations showed how impressive things could be but were full of inefficiencies and we're watching those go away rather quickly.

> structure of revolutions

For anyone who hasn't heard of it, this phrase is a reference to the theory of paradigm shifts in scientific progress, introduced in the book "The Structure of Scientific Revolutions" by Thomas Kuhn.

https://en.wikipedia.org/wiki/The_Structure_of_Scientific_Re...

dale_glass2y ago

How tunable is the voice?

I'm interested in applying TTS to a chat system, and one important feature for that is that there should be as many as possible distinct voices, so that each person would have their own.

Would this, or something else be able to do that?

jpcl2y ago

We support voice cloning so you can mimic the sound of any real voice (or try to create random ones). The prosody/emotions are more difficult to control right now but we are looking into this.

To check how this works in practice you can check the Google Collab link, at the end we are cloning the voice from a Churchill's speech over radio.

dale_glass2y ago

Sounds excellent! What are the requirements to run this regarding hardware? How much VRAM? Does it work on AMD or Intel Arc?

genpfault2y ago

> applying TTS to a chat system

John Madden![1]

[1]: https://knowyourmeme.com/memes/moonbase-alpha-text-to-speech

globalnode2y ago

this is the best tts ive heard, the voice modulates as you'd expect a human to.

Not to step on any toes here (I've starred whisperspeech b/c it really is amazing and I intend to use it), but you should also check out Tortoise [1]. IMO the quality is a little better (for now) but it is painfully slow, even with KV caching it doesn't quite get up to real time on my 4090 except with very short snippets.

1 https://github.com/neonbjb/tortoise-tts

jpcl2y ago

Thanks a lot. :)

We are constantly working on these models and we push new versions every two months or so. It should get even better soon. :)

zerop2y ago

Can this be run on Mac M1?

Idk if it would out of the box, but it should be possible. I know that Whisper (and some variants) run on both x86 and silicon macs.

jpcl2y ago

It should run with PyTorch but the performance might not be great. As a long-time Mac user myself I would love if someone would send a PR to port it to MLX :)

londons_explore2y ago

The first demo on that page was trained from a 32kbps crappy sound quality clip of winston churchill...?

garbage in, garbage out?

Must have been, it sounds very much like the quality of the "we shall fight on the beaches" speech.

A bit unfortunate choice for a demo, sadly.

jpcl2y ago

Good point, thanks. And I was thinking it will show that the model can really synthesize very varied samples... ;)

What’s the text to speech generator that chatGPT uses? It’s the most impressive one I’ve heard so far.

miki1232112y ago

If you think OpenAI's TTS is impressive, you should check out Eleven Labs. They have the highest quality models IMO. Voice quality, emotional awareness / inflection and support for foreign languages are top-notch, it's that last point that OpenAI seems to have the most issues with. If you find a good voice to clone, the latest models can even replicate somewhat unusual accents and speaking styles.

For plain old English TTS with a stock voice, there isn't that much of a difference (although Eleven Labs still wins IMO), but if you need either voice cloning or foreign language support, nothing else comes even close.

With that said, Eleven is extremely pricy, something like Azure TTS (which is the best among the cheap options) may be a better fit for less demanding applications.

reissbaker2y ago

The quality difference between Eleven and OpenAI is IMO pretty small, but the price difference is enormous: for 50,000 characters (approx 1hr of audio, by Eleven's estimates), you'd pay Eleven Labs $9 assuming you're in their highest $330/month payment commitment tier; for OpenAI there's no minimum commitment and the same number of characters would cost $0.75.

If you're generating speech once and replaying it many times (e.g. making podcasts), the difference is negligible and you might as well go with Eleven Labs, since it's more customizable and possibly slightly higher quality. If you're doing interactive speech with customers, $9/hr is incredibly expensive (higher than hiring a minimum-wage worker in the U.S.!), and OpenAI's TTS is a very close second best and much more reasonably priced. If you're trying to integrate speech into an AI product, Eleven makes your hourly costs pretty unfeasible since you have to at minimum charge your customers more than it costs to hire a human being to do a task.

Azure's "Neural" line of TTS is the best of the big cloud offerings, but it's pretty mediocre compared to either OpenAI or Eleven Labs IMO. And it's actually more expensive than using OpenAI: it's $0.80 for 50,000 characters (~1hr), unless you're willing to commit to over $1k monthly spend, at which point it's barely cheaper than OpenAI at $0.64 per 50k characters.

OpenAI's TTS is IMO the best option for anything interactive, since it's so much higher quality than Azure's Neural TTS and so much cheaper (with very little quality difference) as compared to Eleven Labs.

Maybe I’m not a good judge but OpenAI’s voices sound very natural to me and seem better than Eleven labs.

etguy2y ago

OpenAI’s own models. They’re also available commercially via API and are pretty affordable.

GaggiX2y ago

They use their own models, and we don't know anything about their architecture (I believe), but you can use them with the OpenAI API.

huac2y ago

you could make an informed guess by looking at what the highest quality open source model is, looking at the current employer of that model's creator, and what they currently work on there

WhackyIdeas2y ago

Can it run local only?

jpcl2y ago

Yes, on a consumer 4090 card it's 12x faster than real-time. We'll benchmark some older cards as well for comparison.

I think it should work pretty good with the Apple's MLX framework as well if anyone would be willing to convert it. :)

WhackyIdeas2y ago

And totally private, as in no internet needed?

chapmank20302y ago

i will torture copper and rinny

Would anyone of you if it is trained to recognize geographical places, famous people, etc?

RockRobotRock2y ago

holy shit what. how?

Havoc2y ago

Same way the algos designed to spot cats in pictures can generate pictures of cats in reverse. Sorta

RockRobotRock2y ago

Open source text to speech has been really lagging, so I'm really happy there's another tool out there.

j / k navigate · click thread line to collapse