StreamVC: Real-Time Low-Latency Voice Conversion (opens in new tab)

(research.google)

99 pointstrevett1y ago38 comments

38 comments

coldblues1y ago

https://github.com/hrnoh24/stream-vc

https://github.com/yuval-reshef/StreamVC

Unofficial implementations of StreamVC

android5211y ago

what is the hardware requirements

huac1y ago

The samples were released a while back: https://google-research.github.io/seanet/stream_vc/

modeless1y ago

Not a very good demo page. It's difficult to judge real world quality with such unenthusiastic reading, unrealistic sentences, and unfamiliar voices. Typical of speech papers. It would be much better if celebrities were used as target voices, as we all know what they sound like and can therefore judge quality better. But I suppose that would be too controversial for Google.

In general I think it is silly that voice cloning research has focused so much (exclusively?) on cloning voices from just a few seconds of audio. It puts a pretty low ceiling on quality. Many nuances of a person's communication style will not be contained in such a small amount of data. Sure you can match their pitch and timbre, but voice cloning should be more than that.

refulgentis1y ago

> But I suppose that would be too controversial for Google.

You don't have to suppose anything: it is actually settled law that its bad to just willy-nilly use people's voices if you feel like it, even if its just a sound-alike!

4 more replies

ascorbic1y ago

For those confused as I was - it's not trying to match the accent of the target speech in those samples, just the timbre. To quote the paper:

> Voice conversion refers to altering the style of a speech signal while preserving its linguistic content. While style encompasses many aspects of speech, such as emotion, prosody, accent, and whispering, in this work we focus on the conversion of speaker timbre only while keeping the linguistic and para-linguistic information unchanged.

judiisis1y ago

What is the current best Foss(or otherwise) implementation for voice changer/anonymiser?

coldblues1y ago

Last time I checked, it was https://github.com/w-okada/voice-changer

Requires a decent amount of VRAM and runs poorly with pretty bad quality (IMO)

ipnon1y ago

Once again we see evidence that AI-for-all is not bottlenecked by research but by the physical limitations of compute infrastructure.

2 more replies

udev40961y ago

Actual paper: https://arxiv.org/pdf/2401.03078

manishsharan1y ago

Are there any use cases that is driving this ? Is there a huge burning need for technology ?

Are kidnappers and con-men a huge under-served market that Google is hoping to serve ? Deep Fake videos not convincing enough to serve the need of fraudsters ?

I am totally against regulating AI but shit like this gives fodder to the other side.

Ukv1y ago

Voice anonymization is the use-case mentioned by the paper. If you're recording a video or communicating online, having this over your voice would benefit privacy by avoiding revealing your real voice that can be matched back to your face/name/job/etc. I think a lot of people are currently reluctant to use their voice at all online for privacy reasons, resorting to only text.

Also allows people uncomfortable with their natural voice, in particular transgender people, to communicate closer to how they wish to be perceived. Or even for someone to use their own natural voice from previous recordings if some temporary or chronic disease/disorder has impaired it.

There are probably a bunch of creative applications - like doing character voices for a D&D session or reading an audiobook. Obviously depends on the preferences of those involved, and many will currently dislike it on the basis of it being AI, but I think over time we'll see the tech integrated in interesting ways.

I imagine the majority of the use will be in entertainment/memes/satire - joining a call with an amusing voice on, or the equivalent of Snapchat's face filters. Not something critical that we couldn't do without, but still a fun application.

I don't see much benefit to kidnappers in this; if you just need to send an anonymous message without much concern about flow and latency, text or traditional TTS is fine.

TylerE1y ago

Since the quality is pretty listenable, one use case I can see is youtubers who want to do voiceovers on their videos, but not be linked to their real world identity.

Heck, I can even see broadcasting uses. Imagine if every on-air personality had good target files made ahead of time, so then when they catch a cold, production runs their lapel mic feed through this, using the "good" target sample, and remove all the congestion and raspiness.

pessimizer1y ago

You're totally against regulating AI, but the idea that AI could aid anonymity makes you want to regulate AI?

> I am totally against regulating AI but shit like this gives fodder to the other side.

You think anonymity is so universally hated that it's actually bad PR for leaving AI completely unregulated? No other problems with AI that you can think of, and also no good reason why someone should be allowed to be anonymous?

leobg1y ago

> applicable to real-time communication scenarios like calls and video conferencing, and addressing use cases such as voice anonymization in these scenarios.

It’s not a desire I ever had. But maybe people are different?

Alternatively, building the solution was so much fun that the question of whether this is a problem that should be solved was never asked.

ganeshkrishnan1y ago

I had couple of usecases for this. One was one of my very young cousin usually has voice chat in his gaming sessions and I wanted to anonymize it.

The second was we got a very enthusiastic video spokesperson but unfortunately she has a very thick non-american accent and this can help us alleviate it.

sahmeepee1y ago

This will not resolve your second issue as it replaces timbre but not accent.

gnat1y ago

From the poster:

In this work, we propose a light-weight (~20M param.) causal voice conversion solution that can run in real-time with low latency on a commercially available mobile device. The key design elements are: (1) using a causal encoder to learn soft speech units; (2) injecting whitened f0 to improve pitch stability without leaking source speaker info.

In our later V2 version, we found that f0 rescaling followed by a NSF-style harmonic-plus-noise conditioning (as is done in RVC) results in better quality.

1 more reply

froglus1y ago

is it like discord or just voice chat, because i like to have things twice!!

neilk1y ago

What are the anticipated use cases?

I know of one: transgender people often would like to alter the timbre of their voice and spend a lot of time training their voice. At least for online scenarios, this can just do it.

But other than that AI voice altering research seems like it benefits mostly scammers? I’m just wondering what they tell themselves they’re doing. I didn’t see this in the paper.

swatcoder1y ago

I think it's hard to see the use case right now because the quality remains pretty dreadful.

But the prototypical legitimate use case (which we needn't be excited about), is a voice over artist leasing their timbre instead of their time so that new text can be made to sound like them without their being actively involved. If it were to become mature (which doesn't seem close, from this example), it would be a big step up from existing phone tree voice assemblage and would open the doors for dubbing, animation voiceover, harmonization, and ADR in commercial sound and film.

Gender masking or general anonymization aren't really served by this, as you don't need to adopt a specific target timbre to deliver on those. There are other techniques that work perfectly well for those uses, some that have already been around for ages.

pgt1y ago

From the abstract: "making it applicable to real-time communication scenarios like calls and video conferencing, and addressing use cases such as voice anonymization in these scenarios."

thinkski1y ago

I suspect one is masking that a call center is in a low wage country, e.g. make customer in U.S. believe they’re talking to someone in U.S. while paying a fraction of the U.S. wage.

neilk1y ago

Right. I thought of that too, but it doesn’t mask accents, at least not yet

I suppose if you could make agents all sound the same they would be interchangeable, and companies always love that. It’s Anjali or Ligaya or Dolores but now they all sound like “Becky”?

Dracophoenix1y ago

Voiceover/broadcasting. Recording or acquiring any audio that isn't freely licensed background music is among the most expensive and time consuming parts of a prerecorded broadcast. With voice alteration, a director and sound engineer can become their own actors in anything ranging from commercial spots to large-scale and long-running animated shows.

ChrisMarshallNY1y ago

I suspect that fraudsters will love this tool.

I really believe that we are entering a "golden age" of fraud. It will be crazy.

viila1y ago

The first case you mention are scammers too really. They're trying to deceive others into believing they're something they're not, especially with this sort of voice manipulation.

numpad01y ago

so there are regions and societies where the former of those use cases is massively more common than the latter, and then there are also...

webappguy1y ago

RFK Jr.

neilk1y ago

You’re getting downvoted perhaps because people think you’re saying something political, but I think you mean “a stronger voice for people with physical issues producing speech”.

I have a friend who has a faint, scratchy voice because his throat is riddled with benign growths that a surgeon has to dig out of him every few years. Eventually he will probably lose his voice. Maybe?

j / k navigate · click thread line to collapse

38 comments

coldblues1y ago

https://github.com/hrnoh24/stream-vc

https://github.com/yuval-reshef/StreamVC

Unofficial implementations of StreamVC

android5211y ago

what is the hardware requirements

huac1y ago

The samples were released a while back: https://google-research.github.io/seanet/stream_vc/

modeless1y ago

refulgentis1y ago

> But I suppose that would be too controversial for Google.

You don't have to suppose anything: it is actually settled law that its bad to just willy-nilly use people's voices if you feel like it, even if its just a sound-alike!

4 more replies

ascorbic1y ago

For those confused as I was - it's not trying to match the accent of the target speech in those samples, just the timbre. To quote the paper:

judiisis1y ago

What is the current best Foss(or otherwise) implementation for voice changer/anonymiser?

coldblues1y ago

Last time I checked, it was https://github.com/w-okada/voice-changer

Requires a decent amount of VRAM and runs poorly with pretty bad quality (IMO)

ipnon1y ago

Once again we see evidence that AI-for-all is not bottlenecked by research but by the physical limitations of compute infrastructure.

2 more replies

udev40961y ago

Actual paper: https://arxiv.org/pdf/2401.03078

manishsharan1y ago

Are there any use cases that is driving this ? Is there a huge burning need for technology ?

Are kidnappers and con-men a huge under-served market that Google is hoping to serve ? Deep Fake videos not convincing enough to serve the need of fraudsters ?

I am totally against regulating AI but shit like this gives fodder to the other side.

Ukv1y ago

I don't see much benefit to kidnappers in this; if you just need to send an anonymous message without much concern about flow and latency, text or traditional TTS is fine.

TylerE1y ago

Since the quality is pretty listenable, one use case I can see is youtubers who want to do voiceovers on their videos, but not be linked to their real world identity.

pessimizer1y ago

You're totally against regulating AI, but the idea that AI could aid anonymity makes you want to regulate AI?

> I am totally against regulating AI but shit like this gives fodder to the other side.

leobg1y ago

> applicable to real-time communication scenarios like calls and video conferencing, and addressing use cases such as voice anonymization in these scenarios.

It’s not a desire I ever had. But maybe people are different?

Alternatively, building the solution was so much fun that the question of whether this is a problem that should be solved was never asked.

ganeshkrishnan1y ago

I had couple of usecases for this. One was one of my very young cousin usually has voice chat in his gaming sessions and I wanted to anonymize it.

The second was we got a very enthusiastic video spokesperson but unfortunately she has a very thick non-american accent and this can help us alleviate it.

sahmeepee1y ago

This will not resolve your second issue as it replaces timbre but not accent.

gnat1y ago

From the poster:

In our later V2 version, we found that f0 rescaling followed by a NSF-style harmonic-plus-noise conditioning (as is done in RVC) results in better quality.

1 more reply

froglus1y ago

is it like discord or just voice chat, because i like to have things twice!!

neilk1y ago

What are the anticipated use cases?

I know of one: transgender people often would like to alter the timbre of their voice and spend a lot of time training their voice. At least for online scenarios, this can just do it.

But other than that AI voice altering research seems like it benefits mostly scammers? I’m just wondering what they tell themselves they’re doing. I didn’t see this in the paper.

swatcoder1y ago

I think it's hard to see the use case right now because the quality remains pretty dreadful.

pgt1y ago

From the abstract: "making it applicable to real-time communication scenarios like calls and video conferencing, and addressing use cases such as voice anonymization in these scenarios."

thinkski1y ago

I suspect one is masking that a call center is in a low wage country, e.g. make customer in U.S. believe they’re talking to someone in U.S. while paying a fraction of the U.S. wage.

neilk1y ago

Right. I thought of that too, but it doesn’t mask accents, at least not yet

I suppose if you could make agents all sound the same they would be interchangeable, and companies always love that. It’s Anjali or Ligaya or Dolores but now they all sound like “Becky”?

Dracophoenix1y ago

ChrisMarshallNY1y ago

I suspect that fraudsters will love this tool.

I really believe that we are entering a "golden age" of fraud. It will be crazy.

viila1y ago

The first case you mention are scammers too really. They're trying to deceive others into believing they're something they're not, especially with this sort of voice manipulation.

numpad01y ago

so there are regions and societies where the former of those use cases is massively more common than the latter, and then there are also...

webappguy1y ago

RFK Jr.

neilk1y ago

You’re getting downvoted perhaps because people think you’re saying something political, but I think you mean “a stronger voice for people with physical issues producing speech”.

j / k navigate · click thread line to collapse