My ideal would be an open source "deepfake toolkit" that allows me to provide pre-recorded samples of her speech and then TTS in her voice. Unfortunately most articles and tools I'm finding are anti-deepfake. Any recommendations?
Fallback would be recording her speaking "phonetic pangrams" and then using her pre-recorded phonemes to recreate speech that sounds like her. I feel like the deepfake toolkit is the way to go. Appreciate any recommendations... There must be open source tools for this??
If you want to read up on the basics, check out the SV2TTS paper: https://arxiv.org/pdf/1806.04558.pdf Basically you use a speaker encoding to condition the TTS output. This paper/idea is used all over, even for speech-to-speech translation, with small changes.
There's a few open-source version implementations but mostly outdated--the better ones are either private for business or privacy reasons.
There's a lot of work on non-parallel transfer learning (aka subjects are saying different things) so TTS has progressed rapidly and most public implementations lag a bit behind the research. If you're willing to grok speech processing, I'd start with NeMo for overall simplicity--don't get distracted by Kaldi.
Edit: Important note! Utterances are usually clipped of silence before/after so take that into account when analyzing corpus lengths. The quality of each utterance is much much more important than the length--fifteen.ai's TTS is so good primarily because they got fans of each character to collect the data.
But obviously also attend to the human matters as well, eg spend time.
On the upside, your father can choose any celebrity he wants to voice him! Tons of celeb data is publicly available (VoxCeleb 1 & 2).
I've been wanting to create a TTS of myself so I can take phone calls using headphones and type back what I want to say so that I don't have to yell private information out loud in public locations. Would be nice if during non-COVID times I could sit in a train seat and take phone calls completely silently.
You can set it up yourself with a bit of Python knowledge from this branch: https://github.com/talonvoice/noise/tree/speech-dataset
There are keyboard shortcuts - up/down/space to move through the list and record quickly.
If you want to use it on arbitrary text prompts, you can modify this function to return each line from a text file: https://github.com/talonvoice/noise/blob/speech-dataset/serv...
If you use this, before recording too much, do some test recordings and make sure they sound ok. Web audio can be unreliable in some browsers.
The uploaded files are named after the short name, so make sure you can correspond the short name with the original text prompts, eg with string_to_shortname().
If you aren’t easily able to do this yourself, I’d be happy to spin up an instance of it for you with text prompts of your choosing.
Also, I noted the VLC demo says it doesn't use DNS! That's awesome...
Is that something that would be useful to a researcher in any context? I am intrigued by the idea of having my voice preserved (you know, ego), but also am happy to donate the sound files if they would help researchers in any way for datasets.
If so: chris@theamphour.com
In general, yes, this is probably useful data in some way for speech recognition or TTS.
I would say also consider recording a variety of honest utterances of all kinds, situations, and emotions. Anger outbursts, apathetic grunts, sexual even if you so desire (hence throwaway account)... Please dont be offended by this, just thinking of all scenarios for you to decide for yourself...
https://github.com/daanzu/speech-training-recorder
Originally intended for recording data for training speech recognition models [0], it should work just as well for recording to be used for speech synthesis.
I cannot find it now, but I believe he wrote about this exact phenomenon: even with the best technology, you cannot communicate as fluently as a conversation demands, so you're relegated to the background.
Here's one of his writings I was able to find: https://www.rogerebert.com/roger-ebert/i-think-im-musing-my-...
That way, you can retrain an existing AI to do text to speech with her own voice.
Edit: here's a link to the corpus that I believe Mozilla uses http://www.openslr.org/12/
It might make sense to consider making a recording that is more meaningful, and focus on giving her emotional support rather than building an AI that could be perceived as a replacement.
I believe some speakers only recorded 1-2 hours, which seems doable.
I think OP would ideally want the model to pick up on more natural intonation, instead of monotone dictation. Record everything from now on, as best you can with similar recording context, and hopefully that data will be enough to cover more natural nuances.
'This AI Clones Your Voice After Listening for 5 Seconds'
The downvoted commenter was being a jerk, but I do think learning ASL is an option worth looking into.
If you've ever had a mouth injury that inhibits talking, or been in a foreign environment where your speech is totally useless, it can be very stressful to be unable to communicate. I think the couple should consider learning some of the basics ahead of time, so that communication is possible without typing or any other apparatus.
Considering post-surgery recovery window, I'd want to be able to express very basic things like:
I am comfortable
I am in pain
I am hungry
I am nauseated
I need to urinate/defecate
I want to rest
I love you
When will you return
etc. I might suggest trying to boil down one or two inside-joke kinds of phrases as well, to be able to lift each others spirits in private or intimate way.
Also, if you’re not in America, you can learn your local sign language (e.g. British Sign Language, AusLan)
Obviously, it comes with great effort on both the part of the wife and OP, plus a rethinking of some social interactions and even social groups.
However, no problem is insurmountable with sufficient assistance and support from friends, family, and expert groups. Learning sign language is fun and a great way to meet new friends, hearing and Deaf alike.
It may be a last resort, but it's an option not to be ignored.
Outside of social situations, it honestly hasn't been that big of deal for me. As a remote developer, my job has remained the same. My managers and co workers have been super supportive. I send messages during meetings to one person who will read it aloud for me.
With text and social media, I still keep up with friends and family. Most medical appointments, etc, can be made online. SprintIP relay is free for deaf/speech impaired, and it allows the caller to type what they want to say and a representative will relay this to the other party. It works via the web or a mobile app. https://www.sprintrelay.com/sprintiprelay
Banks, brokers, or anything involving personal info (like SS#) usually requires a voice phone call. I have my wife call and explain the situation. I can whisper yes, as they occasionally require me to give permission. Some call center representatives have no idea how to handle this situation, and will just stick to the script saying they have to speak to me the entire time. My wife just thanks them, calls back, and hopes for someone more understanding.
There are awkward encounters where people don't know you can't speak, and will respond by speaking louder and slower. These people will also assume you are not intelligent and be dismissive. This is just one of the things you have to deal with.
I sincerely hope the procedure goes well and you wife doesn't have to deal with this. Just know that even if the worse happens, she can have a normal and productive life!
It sucks you have to just deal with it.
Did you ever consider learning sign language?
We have a lot of tapes around of his voice, from voice mails to family videos to some things from his work. If you are open to reaching out that would be awesome, I’ll check out the site as well.
Edit: I’ve wanted to make some sort of soundboard + “text to talk” setup for this family member. He often can’t participate in conversations because he writes on a whiteboard, and the speed of chatter moves faster than his writing
We also have an API that you might find useful for the soundboard project: https://app.resemble.ai/docs
Out of interest what are the average response times to generate a clip of one or two sentences from a configured voice?
Imagining the easy text-to-speech solution the OP could build on this resemble API.
Not only will you have your own personal "audio books" of Harry Potter/The Hobbit/Chronicles of Narnia/Oi Frog/Alice in Wonderland/Roald Dahls etc etc for any kids/grandkids/relatives etc that will hopefully be something treasured in its own right, but you'll also have a large corpus of training data from well-known texts that you can retrain over and over as the tech improves in the future. Might be worth chucking in some other well-known texts to avoid over-fitting on a "kids' story voice" - maybe something plain like inauguration speeches/declaration of independence/magna carta/etc.
Obviously I'd focus on gathering raw material now, and focus on the reconstruction later when you've all recovered mentally and physically to whatever happens. The more data the better when it comes to this sort of thing. There might not be something "simple" right now (e.g. you could probably implement the WaveNet or similar paper yourself today, and training it up on some GPUs in your spare room etc, but in a few years there might be a nice WYSIWYG/SaaS thing for it), but with the recordings safely stored you'll obviously be able to use it in the future.
Best of luck to you both.
I might be wrong though.
We cannot rule out she wants to spend quality time with her partner instead of spending time in a recording studio, so that, if the worst outcome comes, her husband can remind her of what she lost.
The sentiment is admirable, but it's a lot of work considering that the probability of a negative outcome is very low.
I'm not sure there's a correlation to other senses, I can't see for my future self or move on his behalf. I suppose there are things I would want to taste or smell if I was going to lose those senses, but those are experiences for me, not things I'd use to communicate with loved ones.
After losing my voice in an accident, I'd be willing to spend many, many hours transcribing my own speech in the handful of scratchy family videos, voicemails, and phone logs of ordinary conversations. If I could spend a couple days prior to the event reading some books, a TTS training corpus, or anniversary/birthday/wedding/etc greetings and congratulations into a microphone and have a personal text-to-speech voice I'd be all over that.
It would be a little weird if someone else used it as their narrator, but that's not OP's goal.
Speaking of recording books and training corpuses, my grandparents (who have their voices) got a special kind of joy from reading children's books that they once read to me and that I once heard as a child to their new grandson. OP, if you and your wife have or might have kids (and she can handle it emotionally), it might be nice to record video/audio of reading children's books to future grandchildren. Even if your future grandchild knows that grandma can't read books out loud, I'd bet Grandma would be happy to silently turn the pages for a toddler on her lap until those digital recordings got worn and scratchy like an old VHS.
This is less of a problem with modern high-quality mics than it was, say, with answering machines 30 years ago. Your voice might still sound not exactly the same, but it hopefully shouldn't be unbearably grating either.
Recording audio and then choosing not to use it later is fine.
Not recording it because I don't want it right now... maybe fine? maybe sad.
Having spent a good deal of time in hospitals, a few things I recommend... 10’ phone cable since outlets can sometimes be far from the bed, cheap slippers she can wear to walk around (stepping in a hospital hallway mystery puddle wearing just socks is very unpleasant), comfy clothes that you don’t mind having ruined (T-shirts, underwear, shirts, pajama pants - they can temporarily unhook the IV so she can put a T-shirt on), earplugs, eye mask. If she’s going to be on liquid-only diet, bring your own since hospital food is not great, not terrible. Soylent/Orgain/Ensure if she’s permitted that, otherwise good quality Italian ices are such a nice treat and most hospitals have a patient fridge/freezer you can store them in. Broth, but go to a restaurant or grocery store/farmers market with hot soup bar and fill a container with just the broth from the chicken noodle soup. It’s INFINITELY better than boxed broth.
Hopefully all of your research and preparation will be for nothing, I wish you and your wife a successful surgery!
I am going to assume that your wife and you have a healthy relationship with strong communication, in part because you've developed an intuition for her body language and other non-verbal communication methods. In the scenario where she loses her ability to speak, even if she happily and completely takes to whatever technical solution(s) you offer to replace that, I think it's likely she will reflexively lean more heavily on those non-verbal channels, and you're going to need to get better at reading them than you are now.
https://speech.microsoft.com/customvoice
I imagine if MS offers custom voices then the other text to speech providers do as well.
Good luck
> We evaluated our Kennedy results qualitatively along the following dimensions: ... naturalness of the composited articulation; ...
Obviously the state of the art will have advanced, but maybe this can point the way toward more current research.
While I tend to agree with everyone else that this can be a great idea, my instinct is to float the idea to your wife first and see how she responds. I can imagine someone taking this negatively.
https://www.youtube.com/channel/UCID5qusrF32kSj-oSGq3rJg/vid...
Just in case. Record specific messages for various people in her life, that can be used repeatedly, Children, Mom, Dad, siblings, in-laws, friends, messages like: "X, I love you", "X, I miss you.", "Mommy loves you!" "Give me a hug". "Holiday Greeting", "Happy Birthday","I'm so proud of you!" favorite happy saying, frustration saying,
You get the idea.
Recording a message to a yet unborn grandchild is maybe something we could all do!
We also used the Verbally premium iPad app to help give him a voice and make transactions on easier.
Wishing you all the best.
The paper https://arxiv.org/abs/1904.05441 has a list of spoofing methods.
Here's one method as paper https://arxiv.org/pdf/1806.04558.pdf
And here on GitHub https://github.com/CorentinJ/Real-Time-Voice-Cloning
It’s a bit dated at this point, but I imagine the research has vastly improved since then.
It’s a very good question though. A decade ago this was able to be done for one man. Is it now possible to be done for anyone? Like others, I’d guess the first step is to record everything while you can.
You ideally want five hours of clean speech (good microphone, no background noise, high sample rate). It should be spoken clearly, in a single tone or mood. My model sounds awful because the data isn't consistent, and the room tone and microphones are terrible.
If you want different prosody or moods, don't mix them in the same data set.
You can experiment with transfer learning LJSpeech with Nvidia Tacotron2 right now. Glow-tts is also promising.
You'll start to get results with fifteen minutes of sample data, but for high quality you want a lot of audio.
Have your wife read a book and record it. The training chunks will be ~10 seconds apiece, so keep that in mind for how to segment the audio.
Focus on getting lots of good sounding data. Hours. The models will improve, but this may be your only shot of acquiring the data.
Download the LJSpeech dataset and listen to it. See how it sounds, how it's separated. That is a fantastic dataset that has yielded tremendous results, and you can use it for inspiration.
Get a decent audio headset, have it record the audio to her phone, and spend hours talking to her about whatever. Preferably in a reasonably quiet environment.
Just spend a lot of time talking. You don't have to talk to her through a headset. Just make sure hers is recording her voice.
It would be easy, painless, and probably good for the relationship too.
Make sure the recordings are of a good quality. This will ensure that you will have a baseline TTS of her voice at the minimum.
Since ALS (aka Lou Gehrig's disease) is a degenerative motor neuron disease, people with ALS can pretty much count on eventually losing the ability to speak. So "voice banking" is apparently pretty common.
https://play.google.com/store/apps/details?id=org.anaisbetts...
This is a text-to-speech app with a very keen emphasis on Day To Day usage - the UX will put the focus at the right places, help you reply faster, etc. I used it for a full month when I was unable to speak after voice surgery and it made a big difference, other folx have reported the same
Do you and your wife drink alcohol a bit? If so might it be worth having a couple of drinks in a quiet setting with her one evening with microphones running? I'm not suggesting getting wasted! I'm just wondering whether it might help to catch her getting more animated or "natural" in conversation. I was thinking this might help make the resulting synthesized speech capture even more of her personality than reading children's books or subsets of AI corpora etc.
The voice cloning can be done in a matter of minutes. (< an hour) Its also very easy to use the website.
Best of luck!
1. I spend a lot of time online. It doesn't matter so much there. I do a lot of typing.
2. My oldest son, who had serious output difficulties as a child, is talented at inferring what I need from a gesture and a grunt. This has proven enormously helpful.
3. Consider using her phone as a communication device. It's small and people tend to take their phone everywhere and she can type out what she wants to say.
4. Writing tweets can help a person learn to say things more succinctly. I do freelance writing and figuring out how to say things succinctly is a talent you can develop. (It's something I have to work at -- I'm a "would have written you a shorter letter if I had more time" type of person.) This can help enormously when you face communication barriers.
5. Take some time to deal with the emotional stuff. It matters.
I'm sorry you are facing this. Best of luck.
Also the model is not saved in the browser with Colab so you might also want to do it locally to save it eventualy (if it comes to that).
All the best mate!
[0] Main repo: https://github.com/CorentinJ/Real-Time-Voice-Cloning [1] Google colab repo to try it out: https://github.com/CorentinJ/Real-Time-Voice-Cloning/blob/ma...
https://www.tobiidynavox.com/en-gb/software/web-applications...
https://mycroft.ai/blog/mimic-2-is-live/ https://github.com/MycroftAI/mimic2
Search Results Web results
Festival Speech Synthesis has a tool for recording speech databases, and some tutorials for training festival voices. http://www.cstr.ed.ac.uk/research/projects/speechrecorder/
What you need to do is spend the entire next 3 weeks doing voice banking. This will give your wife a text-to-speech voice (SAPI 5 voice, or others, for example). You record phrases that the voice banking service wants you to speak, with a high quality headset (best if wired) in a quiet setting.
The more sentences (samples) you have, the better the voice will be, obviously. But, there are services out there that will update the recordings, as the technology gets better, and that is the way to go, in terms of choosing the "best service".
The voice banking services that people typically use are here: https://www.mndassociation.org/professionals/management-of-m...
I would say that Acapela my-own-voice is currently the best technology. Obviously there are open source technologies, but you do not have the luxury of time to figure all of that out. However, you should do your own voice banking for later post-processing on your own with open source stuff.
There is also a free version of voice banking available, but I would only recommend it as a secondary tool: https://www.modeltalker.org/
This app (iOS and Android) for example, allows you to use your personal voice banked text-to-speech voice, to talk: https://therapy-box.co.uk/predictable
This is another great app that allows you to use your personal voice banked text-to-speech voice: https://www.assistiveware.com/products/proloquo4text
Source: Disabled engineering student, who is extremely interested in assistive technology. I would love to be a rehabilitation engineer.
Best of luck to the two of you. I really hope you don't ever need this technology.
[1] https://dam-prod.media.mit.edu/x/2018/03/23/p43-kapur_BRjFwE...
It's been mentioned a bit already, but thought it was worth calling out. This may be one of the lowest-overhead ways to start experimenting, at least in terms of data collection.
I have to say I didn't help as much as I thought I could and afterwards I was always wondering if I could have used this technology or that and done more.
So - I think you should recognize that you can only do so much, we're doing the best we can, and in the end we are all winging it.
There's also open source TTS from Mozilla: https://github.com/mozilla/TTS
https://www.descript.com/lyrebird-ai
I hope good folks in there will help you, try reaching them.
deepfake for voice: https://github.com/CorentinJ/Real-Time-Voice-Cloning
Reproducing emotional voices: https://www.sonantic.io/
[1] https://www.ted.com/talks/rupal_patel_synthetic_voices_as_un...
https://www.cnn.com/2018/06/15/health/dystonia-jamie-dupree-...
He uses a text-to-speech system that sounds more-or-less like him.
You said there is a small chance, so I really wish you and your wife the best of luck that she and her voice will be fine after the surgery.
Maybe also if she has a favourite book or a favourite quote, get those recorded too.
Back it all up!
Beyond the techical answer, you may want her to record some nice personal words addressed to your family that you can listen to later.
You don't need to do anything until the worst case materialises.
https://github.com/daanzu/speech-training-recorder
The recorder works with Python 3.6.10. Need to pip install webrtcvad also.
The only tip I have is from a bit of amateur sound editing I did: collect many samples, and beware of big phrases: Like, ask her to say the same thing many times. And ... sometimes ... to ... stop ... at ... each ... word. And ... so ... me ... ti ... mes at each syllable.
Otherwise, if you ever need to create a sample that contains a single word/syllable, you cant. It is weird how much sound that contains clearly distinguishable syllables for the human ears still is not separable when you go to edit it.
Also, you might want to check wordlists by frequency to get a menu of common words, and ipa notation, to ensure you cover a good range of sounds
Don’t know why you’re being downvoted. Thought it was insightful.
Good luck and best wishes! <3
Later, you can extract all the phonemes you want from it and you will retain the emotional expressiveness of her voice.
She should probably sing some songs -- lullabies, rock, etc. Go for emotional diversity.
Is this something that she wants? She's got a lot on her plate (emotionally and logistically) to prepare for this surgery, and maybe doesn't need a big geek project inflicted upon her just because there's a small chance of a bad outcome.
also it might just help pass the time since OP has 3 weeks.
Reach out to Andrew Mason.