AI speech generator 'reaches human parity' – but it's too dangerous to release (opens in new tab)

(livescience.com)

53 pointsinsamniac1y ago64 comments

64 comments

I'm old enough to remember some number of months ago when GPT2 was described as "too dangerous to release".

Remember when the PlayStation 2 was "technically a supercomputer" and taking LSD a certain number of times made you insane? Great moments in marketing history

jayGlow1y ago

funnily enough the ps3 was literally a super computer, at least if you hooked up enough of them

https://phys.org/news/2010-12-air-playstation-3s-supercomput...

klowner1y ago

Oh yeah I vaguely recall news reports about how the PS2 was so powerful it could be used for missile guidance or some nonsense?

1 more reply

Paul-Craft1y ago

Huh. Today, a college kid without even any underlying knowledge of the math can train their own GPT-2-level language model as a semester project.

1 more reply

devmor1y ago

I would say that this take was correct, just not in the way the detractors at the time intended. The danger was to the usefulness of the internet.

I have yet to see any benefit to society from GPT's improvements, but I do see the internet quickly becoming more and more unusable due to the inundation of machine-generated spam on nearly every communications platform.

goatlover1y ago

By more unusable, do you mean more dead? As in LLMs are helping make the Dead Internet Theory real?

1 more reply

rmorey1y ago

“some number” would be 65. things change

kbelder1y ago

>things change

Often predictably.

thevillagechief1y ago

Ah, the old "it's too dangerous to release" marketing move. Why even tell us about it?

cuddlyogre1y ago

For the extra dollars they'll obtain when it is no longer too dangerous to release.

HeatrayEnjoyer1y ago

Not everything is a conspiracy. It is very plausible they're speaking genuinely.

fooqux1y ago

And it's even more plausible this is just a marketing play to build hype. Take, for example, you just made some new super-pathogen in your basement lab. It could kill everyone on the planet. This is obviously pretty dangerous, so do you:

A) silently dispose of it and hope nobody else ever makes the mistake of creating it.

B) keep it in the freezer and hold a press release about how you made it but it's too dangerous to share any details.

1 more reply

recursive1y ago

That would require a for-profit corporate entity to be motivated by public safety. I can suspend my disbelief to some degree, but that's too far.

Tao33001y ago

This isn't the first time someone has spoken like this. It won't be the last. It's 100% marketing.

1 more reply

exe341y ago

I have a bridge I'm selling....

htrp1y ago

https://arxiv.org/pdf/2406.05370

The model in question is Microsoft Vall-E2 without the click bait headline.

rentonl1y ago

Of course, this technology must only stay in the hands of our trusted corporate overlords.

devmor1y ago

Unless of course, you're willing to pay a monthly subscription fee for access to it. Then the danger has been mitigated.

carapace1y ago

https://arxiv.org/abs/2406.05370

> Our experiments on the LibriSpeech and VCTK datasets show that VALL-E 2 surpasses previous systems in speech robustness, naturalness, and speaker similarity. It is the first of its kind to reach human parity on these benchmarks. Moreover, VALL-E 2 consistently synthesizes high-quality speech, even for sentences that are traditionally challenging due to their complexity or repetitive phrases

https://www.microsoft.com/en-us/research/project/vall-e-x/va...

> This page is for research demonstration purposes only. Currently, we have no plans to incorporate VALL-E 2 into a product or expand access to the public.

If you go back and look at older cities they almost all have the same pattern: walls and gates.

I figure now that the Internet is a badlands roamed by robots pretending to be people as they attempt to rob you for their masters, we'll see the formation of cryptologically-secured enclaves. Maybe? Who knows?

At this point I'm pretty much going to restrict online communication to encrypted authenticated channels. (Heck, I should sign this comment, eh? If only as a performance?) Hopefully it remains difficult to build an AI that can guess large numbers. ;P

chx1y ago

Things are progressing just as https://youtu.be/xoVJKj8lcNQ predicted.

> so 2024 will be the last human election and what we mean by that is not that it's just going to be an AI running as president in 2028 but that will really be although maybe um it will be you know humans as figureheads but it'll be Whoever greater compute power will win

We saw already AI voices influencing elections in India https://restofworld.org/2023/ai-voice-modi-singing-politics/

> AI-generated songs, like the ones featuring Prime Minister Narendra Modi, are gaining traction ahead of India’s upcoming elections. [...] Earlier this month, an Instagram video of Modi “singing” a Telugu love song had over 2 million views, while a similar Tamil-language song had more than 2.7 million. A Punjabi song racked up more than 17 million views.

BadHumans1y ago

Unfortunately in the US there is a political party that is attacking education and doesn't want people to learn critical thinking skills at a time when critical thinking is sorely needed. They happen to really "love the poorly educated" for some reason.

Liquix1y ago

classified tech is generally at least 10 years ahead of anything the public has access to. judging by how bizarre and polarizing the previous two US elections have been, i wouldn't be surprised if this prediction had already played out and we just didn't know it yet

bitshiftfaced1y ago

Too dangerous for PR reasons at least until after November.

AnimalMuppet1y ago

If November is the reason, well, this isn't the last November with that concern...

bitshiftfaced1y ago

From a big tech POV, a political scandal would be most PR-damaging right before a big US election. The time from right after to two years before or so, it wouldn't be as big a deal. Once you open Pandora's box, then a scandal in the next election cycle wouldn't be as big of a deal in terms of the company's responsibility/optics since the tech is already out there and it's a "new normal" at that point.

dspillett1y ago

People saying “too dangerous to release” usually means one (or more) of three things:

1. “… but if you and your big rich company were to acquihire us you'd get access…” — though as this is MS it probably isn't that!

2. That is only works as well as claimed in specific circumstances, or has significant flaws, so they don't want people looking at it too closely just yet. The wordage “in benchmarks used by Microsoft” might point to this.

3. That a competitor is getting close to releasing something similar, or has just done so, and they don't want to look like they were in second place or too far behind.

abroadwin1y ago

Relevant research post from Microsoft: https://www.microsoft.com/en-us/research/project/vall-e-x/va...

digitalsushi1y ago

weakest link, if they dont release this, someone else will release one. every time someone noble invents another gen ai toy/weapon, they lock it down with post filters so it cant be used for evil, and then a second person forks it, pops the safeties off, and tells the world to go nuts.

social solutions take too long to use against the tech, but tech solutions are fallible. to be defeatist about it, there's going to be a golden window of time here where some really nasty scams have no impedance.

cuddlyogre1y ago

If you can't trust the mail, the phone, or your computer, make them come to your house with the sheriff.

It's almost as if a consumer protection agency should be created and funded to protect consumers.

spywaregorilla1y ago

I really want something that can do a voice change and match the emotion and articulation of a voice clip that I provide. I don't care (or want) it to be based off a real person and the manners in which they would tend to articulate a sentence. Are there any decent open models out there?

woodson1y ago

Try StyleTTS2. You will still have to experiment with the settings a little to get the right level of adherence to the reference speaker’s voice and the emotion content.

spywaregorilla1y ago

Without looking at this, are you sure that this can do speech to speech? Maybe my flaw in searching has been disregarding anything that's called "text to speech" as not also "speech to speech"?

1 more reply

pphysch1y ago

Speech generation has gotten really good, but there's simply no way to faithfully recreate someone's vocal idiosyncracies and cadence with just "a few seconds" of real audio. That's where the models tend to fall short.

michaelbuckbee1y ago

This was my thought as well, but someone pointed out to me that regional accent identification captures a large percentage of cadence and inflection differences (specific word choices and turns of phrase obviously would still not be there).

Melomololotolo1y ago

I don't think it's hard to get more than a few seconds of voice from many people.

'hi, sry to call you I'm Cindy and I'm from your insurance. I'm calling regarding your car crash ...'.

criddell1y ago

Few seconds means less than a minute. That’s not nothing. Look at a clock and talk for a minute — it’s longer than you might think.

Do you think you could give a recording of a minute of someone talking to a talented impressionist and they could impersonate that person to some degree? It doesn’t seem that far fetched to me.

pphysch1y ago

"Few" doesn't mean <60 it typically means ~5 or <10.

Getting high-quality audio for an arbitrary private citizen via public means isn't that easy, especially for folks like me that don't post video on public social media and use automated call screening and never say a word until the caller has been vetted.

ChrisArchitect1y ago

Project page: https://www.microsoft.com/en-us/research/project/vall-e-x/va...

> "This page is for research demonstration purposes only. Currently, we have no plans to incorporate VALL-E 2 into a product or expand access to the public."

exe341y ago

I too have an agi in my basement, but it's too dangerous to release! wanna give me some cash?

coeneedell1y ago

These samples are terrible when compared to commercially released models like from eleven labs or playht. This is an extension of an interesting architecture but currently those more traditionally based models are way more convincing.

BoredPositron1y ago

I can't wait until the free base models get better. The floods on tiktok, shorts and stories with the standard eleven labs voice is getting nauseating.

mensetmanusman1y ago

A gun can help rob a bank.

A speech generator can help rob 1000 banks.

exe341y ago

it might help if the dumb dumbs at the bank would stop trying to make me say "my voice is my password". I've been careful to only say "no fcuk off you fcuking numpty who came up with this idea after voice cloning hit the mainstream".

AlexDragusin1y ago

"Too dangerous to release" - I could use that line to promote my services :)

mcpar-land1y ago

I can believe a speech generator too good to release, but not even a perfect algorithm can get every one of your inflections and verbal tics with just a few seconds of sample material. Makes me think the whole thing is bs. I instantly see any "ooh our thing we are making on purpose is so dangerous oohhh" as an attempt at regulatory capture until I see proof of the danger.

hi_dang_1y ago

The classic Steven Seagal “my hands are weapons and I need a license for them” rhetoric. What a crock of shit.

Slyfox331y ago

Riiiight

zazazache1y ago

What is the point of them trying to create this? That something like this would mostly be used to create disinformation and create chaos is easily understood before making something like this.

Truly irresponsible

rhdunn1y ago

There are legitimate uses of this tech, such as preserving voices of people losing them such as Stephen Hawking, or making it better for blind/low vision people to follow text and interact with devices. For that latter case having a more natural voice that is also accurate is a good thing.

I use TTS to listen to articles and stories that don't have access to an audiobook narrator. I've used some of the voices based on MBROLA tech, but those can grate after a while.

The more recent voice models are a lot higher quality and emotive (without the jarring pitch transitions of things like Cepstral) so are better to listen to. However, the recent models can clip/skip text, have prolonged silence, have long/warped/garbled words, etc. that make them harder to use longer term.

Paul-Craft1y ago

You're right, of course. Unfortunately, however, we're all just actors in a giant, multi-player, iterated Prisoner's Dilemma here. If I decide not to pursue human-level automated speech generation, or I end up developing it and don't release because it's "too dangerous," someone else will just come in behind me and take all that market share I could have captured.

It's like we're stuck in some movie that came out in 1994[0], or something. Except, in this version, everything is gonna up sooner or later, anyway. Might as well profit from it along the way, right?

Le sigh.

---

[0]: https://www.imdb.com/title/tt0111257/

ninjanomnom1y ago

At least one good use is for video games where the text of some dialogue is determined when you run the game. For example in a game I work on player chat is local and voiced by tts configured by the player for their character.

HeatrayEnjoyer1y ago

Move fast and break things (including organized society).

I can't even think of non malicious uses that are anything more than novelty or small conveniences. Meanwhile the malicious use cases are innumerable.

In a just world building this would be a severe felony, punished with prison and destruction of all of the direct and indirect source material.

mensetmanusman1y ago

Cancer that takes someone’s voice.

pragma_x1y ago

Agreed.

On the one hand, I would love this kind of tech to be available for entertainment purposes. An RPG with convincing NPCs that are able to provide a novel experience for every player? Sounds great.

On the other: this is fraught with ethical problems, not to mention an ideal tool for fraud. At worst, it could be used as a weapon for total asymmetrical warfare on concepts like media integrity and an ideal tool for character assassination; disinformation, propaganda, etc.

I would happy welcome a world where this stuff is nerfed across the board, where videogames and porn are just chock full of AI voice-acting artifacts. We'll adjust and accept that as just a part of the experience, as we have with low fidelity media of the past. But my more cynical side tells me that's not what people in power are concerned about.

ryandrake1y ago

This is what happens when you have an industry full of people "looking for challenging problems to solve" without an ethical foundation to warn them that just because you can build something doesn't mean you should.

unraveller1y ago

The point is to spawn a new medium, you'll have to imagine harder how positive that could be as people with lots of ideas are not going to give them to you for free.

Perfecting the tech for wide-spread use has trade offs; need for caller id, ease of slandering until trust in voice uniqueness recalibrates, all of which is going to change soon anyway but giving only rich/bad actors the tech at first has its own set of trade offs. Head in the sand is the irresponsible way.

j / k navigate · click thread line to collapse

64 comments

jd1151y ago

I'm old enough to remember some number of months ago when GPT2 was described as "too dangerous to release".

01HNNWZ0MV43FF1y ago

Remember when the PlayStation 2 was "technically a supercomputer" and taking LSD a certain number of times made you insane? Great moments in marketing history

jayGlow1y ago

funnily enough the ps3 was literally a super computer, at least if you hooked up enough of them

https://phys.org/news/2010-12-air-playstation-3s-supercomput...

klowner1y ago

Oh yeah I vaguely recall news reports about how the PS2 was so powerful it could be used for missile guidance or some nonsense?

1 more reply

Paul-Craft1y ago

Huh. Today, a college kid without even any underlying knowledge of the math can train their own GPT-2-level language model as a semester project.

1 more reply

devmor1y ago

I would say that this take was correct, just not in the way the detractors at the time intended. The danger was to the usefulness of the internet.

goatlover1y ago

By more unusable, do you mean more dead? As in LLMs are helping make the Dead Internet Theory real?

1 more reply

rmorey1y ago

“some number” would be 65. things change

kbelder1y ago

>things change

Often predictably.

thevillagechief1y ago

Ah, the old "it's too dangerous to release" marketing move. Why even tell us about it?

cuddlyogre1y ago

For the extra dollars they'll obtain when it is no longer too dangerous to release.

HeatrayEnjoyer1y ago

Not everything is a conspiracy. It is very plausible they're speaking genuinely.

fooqux1y ago

A) silently dispose of it and hope nobody else ever makes the mistake of creating it.

B) keep it in the freezer and hold a press release about how you made it but it's too dangerous to share any details.

1 more reply

recursive1y ago

That would require a for-profit corporate entity to be motivated by public safety. I can suspend my disbelief to some degree, but that's too far.

Tao33001y ago

This isn't the first time someone has spoken like this. It won't be the last. It's 100% marketing.

1 more reply

exe341y ago

I have a bridge I'm selling....

htrp1y ago

https://arxiv.org/pdf/2406.05370

The model in question is Microsoft Vall-E2 without the click bait headline.

rentonl1y ago

Of course, this technology must only stay in the hands of our trusted corporate overlords.

devmor1y ago

Unless of course, you're willing to pay a monthly subscription fee for access to it. Then the danger has been mitigated.

carapace1y ago

https://arxiv.org/abs/2406.05370

https://www.microsoft.com/en-us/research/project/vall-e-x/va...

> This page is for research demonstration purposes only. Currently, we have no plans to incorporate VALL-E 2 into a product or expand access to the public.

If you go back and look at older cities they almost all have the same pattern: walls and gates.

chx1y ago

Things are progressing just as https://youtu.be/xoVJKj8lcNQ predicted.

We saw already AI voices influencing elections in India https://restofworld.org/2023/ai-voice-modi-singing-politics/

BadHumans1y ago

Liquix1y ago

bitshiftfaced1y ago

Too dangerous for PR reasons at least until after November.

AnimalMuppet1y ago

If November is the reason, well, this isn't the last November with that concern...

bitshiftfaced1y ago

dspillett1y ago

People saying “too dangerous to release” usually means one (or more) of three things:

1. “… but if you and your big rich company were to acquihire us you'd get access…” — though as this is MS it probably isn't that!

3. That a competitor is getting close to releasing something similar, or has just done so, and they don't want to look like they were in second place or too far behind.

abroadwin1y ago

Relevant research post from Microsoft: https://www.microsoft.com/en-us/research/project/vall-e-x/va...

digitalsushi1y ago

cuddlyogre1y ago

If you can't trust the mail, the phone, or your computer, make them come to your house with the sheriff.

It's almost as if a consumer protection agency should be created and funded to protect consumers.

spywaregorilla1y ago

woodson1y ago

Try StyleTTS2. You will still have to experiment with the settings a little to get the right level of adherence to the reference speaker’s voice and the emotion content.

spywaregorilla1y ago

Without looking at this, are you sure that this can do speech to speech? Maybe my flaw in searching has been disregarding anything that's called "text to speech" as not also "speech to speech"?

1 more reply

pphysch1y ago

michaelbuckbee1y ago

Melomololotolo1y ago

I don't think it's hard to get more than a few seconds of voice from many people.

'hi, sry to call you I'm Cindy and I'm from your insurance. I'm calling regarding your car crash ...'.

criddell1y ago

Few seconds means less than a minute. That’s not nothing. Look at a clock and talk for a minute — it’s longer than you might think.

Do you think you could give a recording of a minute of someone talking to a talented impressionist and they could impersonate that person to some degree? It doesn’t seem that far fetched to me.

pphysch1y ago

"Few" doesn't mean <60 it typically means ~5 or <10.

ChrisArchitect1y ago

Project page: https://www.microsoft.com/en-us/research/project/vall-e-x/va...

> "This page is for research demonstration purposes only. Currently, we have no plans to incorporate VALL-E 2 into a product or expand access to the public."

exe341y ago

I too have an agi in my basement, but it's too dangerous to release! wanna give me some cash?

coeneedell1y ago

BoredPositron1y ago

I can't wait until the free base models get better. The floods on tiktok, shorts and stories with the standard eleven labs voice is getting nauseating.

mensetmanusman1y ago

A gun can help rob a bank.

A speech generator can help rob 1000 banks.

exe341y ago

AlexDragusin1y ago

"Too dangerous to release" - I could use that line to promote my services :)

mcpar-land1y ago

hi_dang_1y ago

The classic Steven Seagal “my hands are weapons and I need a license for them” rhetoric. What a crock of shit.

Slyfox331y ago

Riiiight

zazazache1y ago

What is the point of them trying to create this? That something like this would mostly be used to create disinformation and create chaos is easily understood before making something like this.

Truly irresponsible

rhdunn1y ago

I use TTS to listen to articles and stories that don't have access to an audiobook narrator. I've used some of the voices based on MBROLA tech, but those can grate after a while.

Paul-Craft1y ago

It's like we're stuck in some movie that came out in 1994[0], or something. Except, in this version, everything is gonna up sooner or later, anyway. Might as well profit from it along the way, right?

Le sigh.

---

[0]: https://www.imdb.com/title/tt0111257/

ninjanomnom1y ago

HeatrayEnjoyer1y ago

Move fast and break things (including organized society).

I can't even think of non malicious uses that are anything more than novelty or small conveniences. Meanwhile the malicious use cases are innumerable.

In a just world building this would be a severe felony, punished with prison and destruction of all of the direct and indirect source material.

mensetmanusman1y ago

Cancer that takes someone’s voice.

pragma_x1y ago

Agreed.

On the one hand, I would love this kind of tech to be available for entertainment purposes. An RPG with convincing NPCs that are able to provide a novel experience for every player? Sounds great.

ryandrake1y ago

unraveller1y ago

The point is to spawn a new medium, you'll have to imagine harder how positive that could be as people with lots of ideas are not going to give them to you for free.

j / k navigate · click thread line to collapse