Qwen3-TTS family is now open sourced: Voice design, clone, and generation (opens in new tab)

(qwen.ai)

744 pointsPalmik2mo ago225 comments

225 comments

If you want to try out the voice cloning yourself you can do that an this Hugging Face demo: https://huggingface.co/spaces/Qwen/Qwen3-TTS - switch to the "Voice Clone" tab, paste in some example text and use the microphone option to record yourself reading that text - then paste in other text and have it generate a version of that read using your voice.

I shared a recording of audio I generated with that here: https://simonwillison.net/2026/Jan/22/qwen3-tts/

javier1234543212mo ago

This is terrifying. With this and z-image-turbo, we've crossed a chasm. And a very deep one. We are currently protected by screens, we can, and should assume everything behind a screen is fake unless rigorously (and systematically, i.e. cryptographically) proven otherwise. We're sleepwalking into this, not enough people know about it.

rdtsc2mo ago

That was my thought too. You’d have “loved ones” calling with their faces and voices asking for money in some emergency. But you’d also have plausible deniability as anything digital can be brushed off as “that’s not evidence, it could be AI generated”.

4 more replies

u80802mo ago

https://www.youtube.com/watch?v=diboERFAjkE pretty much this

2 more replies

oceanplexian2mo ago

> This is terrifying.

Far more terrifying is Big Tech having access to a closed version of the same models, in the hands of powerful people with a history of unethical behavior (i.e. Zuckerberg's "Dumb Fucks" comments). In fact it's a miracle and a bit ironic that the Chinese would be the ones to release a plethora of capable open source models, instead of the scraps like we've seen from Google, Meta, OpenAI, etc.

4 more replies

razster2mo ago

I'd be a bit more worried with Z-Image Edit/Base is release. Flux.2 Klein is our and its on par with Zit, and with some fine tuning can just about hit Flux.2. Adding on top of that is Qwen Image Edit 2511 for additional refinement. Anything is possible. Those folks at r/StableDiffusion and falling over the possible release of Z-Image-Omni-Base, a hold me over until actual base is out. I've heard its equal to Flux.2. Crazy time.

TacticalCoder2mo ago

> With this and z-image-turbo, we've crossed a chasm.

And most of all: they're both local models. The cat is out of the box and it's never going back in. There's no censoring of this. No company that can pull the plug. Anyone with a semi-modern GPU can use these models.

fridder2mo ago

Admittedly I have not dove into it much but, I wonder if we might finally have a usecase for NFTs and web3? We need some sort of way to denote items are persion generated not AI. Would certainly be easier than trying to determine if something is AI generated

2 more replies

echelon2mo ago

We're going to be okay.

There are far more good and interesting use cases for this technology. Games will let users clone their voices and create virtual avatars and heroes. People will have access to creative tools that let them make movies and shows with their likeness. People that couldn't sing will make music.

Nothing was more scary than the invention of the nuclear weapon. And we're all still here.

Life will go on. And there will be incredible benefits that come out of this.

6 more replies

magicalhippo2mo ago

The HF demo space was overloaded, but I got the demo working locally easily enough. The voice cloning of the 1.7B model captures the tone of the speaker very well, but I found it failed at reproducing the variation in intonation, so it sounds like a monotonous reading of a boring text.

I presume this is due to using the base model, and not the one tuned for more expressiveness.

edit: Or more likely, the demo not exposing the expressiveness controls.

The 1.7B model was much better at ignoring slight background noise in the reference audio compared to the 0.6B model though. The 0.6B would inject some of that into the generated audio, whereas the 1.7B model would not.

Also, without FlashAttention it was dog slow on my 5090, running at 0.3X realtime with just 30% GPU usage. Though I guess that's to be expected. No significant difference in generation speed between the two models.

Overall though, I'm quite impressed. I haven't checked out all the recent TTS models, but a fair number, and this one is certainly one of the better ones in terms of voice cloning quality I've heard.

thedangler2mo ago

How did you do this locally? Tools? Language?

1 more reply

dsrtslnd232mo ago

Any idea on the VRAM footprint for the 1.7B model? I guess it fits on consumer cards but I am wondering if it works on edge devices.

1 more reply

pseudosavant2mo ago

Remarkable tech that is now accessible to almost anyone. My cloned voice sounded exactly like me. The uses for this will be from good to bad and everywhere in-between. A deceased grandmother reading "Good Night Moon" to grandkids, scamming people, the ability to create podcasts with your own voices from just prompts.

_kb2mo ago

It's a good thing governments (https://www.ato.gov.au/online-services/voice-authentication) and banks (https://www.anz.com.au/security/how-we-protect-you/voice-id/) haven't gone all in on using voice as an authentication mechanism.

parentheses2mo ago

I got some errors trying to run this on my MBP. Claude was able to one-shot a fix.

``` Loaded speech tokenizer from ~/.cache/huggingface/hub/models--Qwen--Qwen3-TTS-12Hz-1.7B-VoiceDesign/snapshots/0e711a1c0aa5aad30654426 e0d11f67716c1211e/speech_tokenizer Fetching 11 files: 0%| | 0/11 [00:00<?, ?it/s]Fetching 11 files: 100%|| 11/11 [00:00<00:00, 125033.45it/s] The tokenizer you are loading from '!/.cache/huggingface/hub/models--Qwen--Qwen3-TTS-12Hz-1.7B-VoiceDesign/snapshots/0e711a1c0aa5aad30654426e0d11f67716c1211e' with an incorrect regex pattern: https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Instr.... This will lead to incorrect tokenization. You should set the `fix_mistral_regex=True` flag when loading this tokenizer to fix this issue. ```

cristoperb2mo ago

I cloned my voice and had it generate audio for a paragraph from something I wrote. It definitely kind of sounds like me, but I like it much better than listening to my real voice. Some kind of uncanny peak.

viraptor2mo ago

They weirdly makes it a canny peak though :)

bsenftner2mo ago

You do realize that you don't hear your real voice normally, an individual has to record their voice to hear how others hear their voice. What you hear when you speak includes your skull resonating, which other's do not hear.

mohsen12mo ago

> The requested GPU duration (180s) is larger than the maximum allowed

What am I doing wrong?

gregsadetsky2mo ago

you need to login

KolmogorovComp2mo ago

Hello, the recording you posted does not tell much about the cloning capability without an example from your real voice.

simonw2mo ago

Given how easy voice cloning is with this thing I chickened out of sharing the training audio I recorded!

That's not really rational considering the internet is full of examples of my voice that anyone could use though. Here's a recent podcast clip: https://www.youtube.com/watch?v=lVDhQMiAbR8&t=3006s

1 more reply

kingstnap2mo ago

It was fun to try out. I wonder if at some point if I have a few minutes of me talking I could make myself read an entire book to myself.

itsTyrion2mo ago

well that isnt concerning at all

simonw2mo ago

I got this running on macOS using mlx-audio thanks to Prince Canuma: https://x.com/Prince_Canuma/status/2014453857019904423

Here's the script I'm using: https://github.com/simonw/tools/blob/main/python/q3_tts.py

You can try it with uv (downloads a 4.5GB model on first run) like this:

  uv run https://tools.simonwillison.net/python/q3_tts.py \
    'I am a pirate, give me your gold!' \
    -i 'gruff voice' -o pirate.wav

genewitch2mo ago

If i am ever in the same city as you, i'll buy you dinner. I poked around during my free time today trying to figure out how to run these models, and here is the estimable Simon Willison just presenting it on a platter.

hopefully i can make this work on windows (or linux, i guess).

thanks so much.

cube002mo ago

> hopefully i can make this work on windows (or linux, i guess).

mlx-audio only works on Apple Silicon

1 more reply

rahimnathwani2mo ago

If you want to do custom voice cloning, record a sample wav file with a sentence or two, and then try this:

  uv tool install --force git+https://github.com/Blaizzy/mlx-audio.git --prerelease=allow
    
  python -m mlx_audio.tts.generate --model mlx-community/Qwen3-TTS-12Hz-0.6B-Base-bf16 --text "Hello, this is a test." --ref_audio path_to_audio.wav --ref_text "Transcript of the reference audio." --play

indigodaddy2mo ago

Simon how do you think this would perform on CPU only? Lets say threadripper with 20G ram. (Voice cloning in particular)

simonw2mo ago

No idea at all, but my guess is it would work but be a bit slow.

You'd need to use a different build of the model though, I don't think MLX has a CPU implementation.

genewitch2mo ago

the old voice cloning and/or TTS models were CPU only, and they weren't realtime, but no worse than 2:1, 30 seconds of audio would take 60 seconds to generate. roughly. in 2021 one-shot TTS/cloning using GPUs was getting there, and that was close enough to realtime; one could, if one was willing to deal with it, wire microphone audio to the model, and speak words, and the model would, in real time, modify the voice. Phil Hendrie is jealous.

anyhow, with faster CPUs and optimizations, you won't be waiting too long. Also 20GB is overkill for an audio model. Only text - LLM - are huge and take infinite memory. SD/FLUX models are under 16GB of ram usage (uh, mine are, at least!), for instance.

gcr2mo ago

This is wonderful, thank you. Another win for uv!

TheAceOfHearts2mo ago

Interesting model, I've managed to get the 0.6B param model running on my old 1080 and I can generated 200 character chunks safely without going OOM, so I thought that making an audiobook of the Tao Te Ching would be a good test. Unfortunately each snippet varies drastically in quality: sometimes the speaker is clear and coherent, but other times it bursts out laughing or moaning. In a way it feels a bit like magical roulette, never being quite certain of what you're going to get. It does have a bit of charm, when you chain the various snippets together you really don't know what direction it's gonna go.

Using speaker Ryan seems to be the most consistent, I tried speaker Eric and it sounded like someone putting on a fake exaggerated Chinese accent to mock speakers.

If it wasn't for the unpredictable level of emotions from each chunk, I'd say this is easily the highest quality TTS model I've tried.

KaoruAoiShiho2mo ago

Have you tried specifying the emotion? There's an option to do so and if it's left empty it wouldn't surprise me if it defaulted to rng instead of bland.

TheAceOfHearts2mo ago

For the system prompt I used:

> Read this in a calm, clear, and wise audiobook tone.

> Do not rush. Allow the meaning to sink in.

But maybe I should experiment with something more detailed. Do you have any suggestions?

1 more reply

dsrtslnd232mo ago

do you have the RTF for the 1080? I am trying to figure out if the 0.6B model is viable for real-time inference on edge devices.

TheAceOfHearts2mo ago

Yeah, it's not great. I wrote a harness that calculates it as: 3.61s Load Time, 38.78s Gen Time, 18.38s Audio Len, RTF 2.111.

The Tao Te Ching audiobook came in at 62 mins in length and it ran for 102 mins, which gives an RTF of 1.645.

I do get a warning about flash-attn not being installed, which says that it'll slow down inference. I'm not sure if that feature can be supported on the 1080 and I wasn't up for tinkering to try.

2 more replies

genewitch2mo ago

it isn't often that tehcnology gives me chills, but this did it. I've used "AI" TTS tools since 2018 or so, and i thought the stuff from two years ago was about the best we were going to get. I don't know the size of these, i scrolled to the samples. I am going to get the models set up somewhere and test them out.

Now, maybe the results were cherrypicked. i know everyone else who has released one of these cherrypicks which to publish. However, this is the first time i've considered it plausible to use AI TTS to remaster old radioplays and the like, where a section of audio is unintelligible but can be deduced from context, like a tape glitch where someone says "HEY [...]LAR!" and it's an episode of Yours Truly, Johnny Dollar...

I have dozens of hours of audio of like Bob Bailey and people of that era.

kamranjon2mo ago

I wonder if it was trained on anime dubs cause all of the examples I listened to sounded very similar to a miyazaki style dub.

genewitch2mo ago

scroll down to the second to last group, the second one down is obama speaking english, the third one down is trump speaking japanese (a translation of the english phrase)

besides, they know what side their bread is buttered on. I feel like this is almost not the real announcement; or, the engineers that wrote this up and did the demos just ran it that way. The normal speech voices are fine (lower than the anime ones on the page.) i agree that the first few are very infantile. I'll change that word if i can think of a better one.

freedomben2mo ago

Indeed, I have a future project/goal of "restoring" Have Gun - Will Travel radio episodes to listenable quality using tech like this. There are so many lines where sound effects and tape rot and other "bad recording" things make it very difficult to understand what was sad. Will be amazing, but as with all tech the potential for abuse is very real

genewitch2mo ago

hey if you want to collab or trade notes, my email is in my profile. there was java software that did FANTASTIC work cleaning up crappy transfers of audio, like, specifically, it was perfect for "AM Quality Monaural Audio".

  Observe, original: https://www.youtube.com/watch?v=YiRcOVDAryM
  my edit (took about an hour, if memory serves, to set up. forgot render time...): https://www.youtube.com/watch?v=xazubVJ0jz4

i say "was [...] software" because the last 2 times i've tried to use it, it did imperceptible cleanup, making it worthless. Anyhow, all my radio plays are from OTRR, i think.

Audio.Restoration.DeNoise.DeNoiseLF.2.8.3_WiN.OSX is a more recent version i think

p.s. are you a "dude named Ben"?

1 more reply

throwaw122mo ago

Qwen team, please please please, release something to outperform and surpass the coding abilities of Opus 4.5.

Although I like the model, I don't like the leadership of that company and how close it is, how divisive they're in terms of politics.

mortsnort2mo ago

They were just waiting for someone in the comments to ask!

zeppelin1012mo ago

Someone has to take the first step. Let's be grateful to the brave anon HN poster for stepping up.

mhuffman2mo ago

It really is the best way to incentivize politeness!

stuckkeys2mo ago

I loled hard at this. Thank you kind stranger.

pseudony2mo ago

Same issue (I am Danish).

Have you tested alternatives? I grabbed Open Code and a Minimax m2.1 subscription, even just the 10usd/mo one to test with.

Result? We designed a spec for a slight variation of a tool for which I wrote a spec with Claude - same problem (process supervisor tool), from scratch.

Honestly, it worked great, I have played a little further with generating code (this time golang), again, I am happy.

Beyond that, Glm4.7 should also be great.

See https://dev.to/kilocode/open-weight-models-are-getting-serio...

It is a recent case story of vibing a smaller tool with kilo code, comparing output from minimax m2.1 and Glm4.7

Honestly, just give it a whirl - no need to send money to companies/nations your disagree with with.

nunodonato2mo ago

I've been using GLM 4.7 with Claude Code. best of both worlds. Canceled my Anthropic subscription due to the US politics as well. Already started my "withdrawal" in Jan 2025, Anthropic was one of the few that was left

3 more replies

TylerLives2mo ago

>how divisive they're in terms of politics

What do you mean by this?

throwaw122mo ago

Dario said not nice words about China and open models in general:

https://www.bloomberg.com/news/articles/2026-01-20/anthropic...

3 more replies

Balinares2mo ago

They're supporters of the Trump administration's military, a view which is not universally lauded.

mohsen12mo ago

With a good harness I am getting similar results with GLM 4.7. I am paying for TWO! max accounts and my agents are running 24/7.

I still have a small Claude account to do some code reviews. Opus 4.5 does good reviews but at this point GLM 4.7 usually can do the same code reviews.

If cost is an issue (for me it is, I pay out of pocket) go with GLM 4.7

imiric2mo ago

Your GitHub profile is... disturbing. 1,354 commits and 464 pull requests in January so far.

Regardless of how productive those numbers may seem, that amount of code being published so quickly is concerning, to say the least. It couldn't have possibly been reviewed by a human or properly tested.

If this is the future of software development, society is cooked.

2 more replies

amrrs2mo ago

Have you tried the new GLM 4.7?

davely2mo ago

I've been using GLM 4.7 alongside Opus 4.5 and I can't believe how bad it is. Seriously.

I spent 20 minutes yesterday trying to get GLM 4.7 to understand that a simple modal on a web page (vanilla JS and HTML!) wasn't displaying when a certain button was clicked. I hooked it up to Chrome MCP in Open Code as well.

It constantly told me that it fixed the problem. In frustration, I opened Claude Code and just typed "Why won't the button with ID 'edit' work???!"

It fixed the problem in one shot. This isn't even a hard problem (and I could have just fixed it myself but I guess sunk cost fallacy).

3 more replies

throwaw122mo ago

yes I did, not on par with Opus 4.5.

I use Opus 4.5 for planning, when I reach my usage limits fallback to GLM 4.7 only for implementing the plan, it still struggles, even though I configure GLM 4.7 as both smaller model and heavier model in claude code

WarmWash2mo ago

The Chinese labs distill the SOTA models to boost the performance of theirs. They are a trailer hooked up (with a 3-6 month long chain) to the trucks pushing the technology forwards. I've yet to see a trailer overtake it's truck.

China would need an architectural breakthrough to leap American labs given the huge compute disparity.

miklosz2mo ago

I have seen indeed a trailer overtake its truck. Not a beautiful view.

1 more reply

overfeed2mo ago

Care to explain how the volume of AI research papers authored by Chinese researchers[1] has exceeded US-published ones? Time-traveling plagiarism perhaps, since you believe the US is destined to lead always.

1. Chinese researcher in China, to be more specific.

3 more replies

aaa_aaa2mo ago

No all they need is time. I am awaiting the dowfall of the ai hegemony and hype with popcorn at hand.

mhuffman2mo ago

I would be happy with an openweight 3 month old Claude

1 more reply

genewitch2mo ago

can you point me at another free voice cloning / tts model with this fidelity and, i guess prompt adherence?

because i've been on youtube and insta, and believe me, no one else even compares, yet.

Onavo2mo ago

Well DeepSeek V4 is rumored to be in that range and will be released in 3 weeks.

aussieguy12342mo ago

I could say the same about grok (although given there are better models for my use cases I don't use it). What part of divisive politics are you talking about here?

sampton2mo ago

Every time Dario opens his mouth it's something weird.

chriswep2mo ago

In my tests this doesn't come close to the years old coqui/XTTS-v2. It has great voice cloning capabilities and creates rich speech with emotions with low latency. I tried out several local-TTS projects over the years but i'm somewhat confused that nothing seems to be able to match coqui despite the leaps that we see in other areas of AI. Can somebody with more knowledge in this field explain why that might be? Or am i completely missing something?

girvo2mo ago

Amusingly one of their examples (the final Age Control example) is prompted to have American English as an accent, but sounds like an Australian trying to sounds American to my ear haha

rahimnathwani2mo ago

Has anyone successfully run this on a Mac? The installation instructions appear to assume an NVIDIA GPU (CUDA, FlashAttention), and I’m not sure whether it works with PyTorch’s Metal/MPS backend.

magicalhippo2mo ago

FWIW you can run the demo without FlashAttention using --no-flash-attn command-line parameter, I do that since I'm on Windows and haven't gotten FlashAttention2 to work.

turnsout2mo ago

It seems to depend on FlashAttention, so the short answer is no. Hopefully someone does the work of porting the inference code over!

Lichtso2mo ago

Yes, using mlx-audio. See https://news.ycombinator.com/item?id=46726440

rahimnathwani2mo ago

Thanks! Simon's example uses the custom voice model (creating a voice from instructions). But that comment led me eventually to this page, which shows how to use mlx-audio for custom voices:

https://huggingface.co/mlx-community/Qwen3-TTS-12Hz-0.6B-Bas...

  uv tool install --force git+https://github.com/Blaizzy/mlx-audio.git --prerelease=allow
    
  python -m mlx_audio.tts.generate --model mlx-community/Qwen3-TTS-12Hz-0.6B-Base-bf16 --text "Hello, this is a test." --ref_audio path_to_audio.wav --ref_text "Transcript of the reference audio." --play

javier1234543212mo ago

I recommend using modal for renting the metal.

PunchyHamster2mo ago

Looking forward for my grandma being scammed by one!

jacquesm2mo ago

So far that seems to be the main use case.

bigyabai2mo ago

Grandmas should know better, nowadays. It's 2026, half of today's grandparents grew up with QVC and landline psychics.

viraptor2mo ago

I can't quite figure this out: Can you save a generated voice for reuse later? The mlx-audio I looked at seems to take the text itself in every interface and doesn't expose it as a separate object. (I can dive deeper, but wanted to check if anyone's done it already)

akadeb2mo ago

You could pipe the output to an audio file with ffmpeg or pyaudio and save it locally

viraptor2mo ago

I don't want to save the audio. I want to save the voice model so I can use it for many different texts, for consistency.

1 more reply

d4rkp4ttern2mo ago

Curious how it compares to last week’s release of Kyutai’s Pocket-TTS [1] which is just 100M params, and excellent in both speed and quality (English only). I use it in my voice plugin [2] for quick voice updates in Claude Code.

[1] https://github.com/kyutai-labs/pocket-tts

[2] https://github.com/pchalasani/claude-code-tools?tab=readme-o...

satvikpendem2mo ago

This would be great for audiobooks, some of the current AI TTS still struggle.

anotherevan2mo ago

Is there any way to take a cloned voice model and plug into Android TTS and/or Windows?

I have a friend with a paralysed larynx who is often using his phone or a small laptop to type in order to communicate. I know he would love it if it was possible to take old recordings of him speaking and use that to give him back "his" voice, at least in some small measure.

thedangler2mo ago

Kind of a noob, how would I implement this locally? How do I pass it audio to process. I'm assuming its in the API spec?

dust422mo ago

Scroll down on the Huggingface page, there are code examples and also a link to github: https://huggingface.co/Qwen/Qwen3-TTS-12Hz-0.6B-Base

daliusd2mo ago

I wanted to try this locally as well so I have asked AI to write CLI for me: https://github.com/daliusd/qtts

There are some samples. If you have GPU you might want to fork and improve this, but otherwise slow, but usable on CPU as well.

7777777phil2mo ago

Here is a Colab Notebook where you can test it on any of the available GPUs (H100, A100, T4): https://colab.research.google.com/drive/1szmNh25TmMpPd4aKjWX...

indigodaddy2mo ago

How does the cloning compare to pocket TTS?

andhuman2mo ago

It’s uncanny good. I prefer it to pocket, but then again pocket is much smaller and for realtime streaming.

indigodaddy2mo ago

Ah right I guess I meant for instant which I assume qwen can't do

quinncom2mo ago

Pocket TTS is much smaller: 100M parameters versus 600–1800M.

indigodaddy2mo ago

Ah right so I guess qwen3-tts isn't going to work for cpu-only like pocket TTS can(?)

1 more reply

gunalx2mo ago

Voice actors are slo cooked. Some of the demos arguably sounded way better than a lot of indie voice-acting.

whinvik2mo ago

Haha something that I want to try out. I have started using voice input more and more instead of typing and now I am on my second app and second TTS model, namely Handy and Parakeet V3.

Parakeet is pretty good, but there are times it struggles. Would be interesting to see how Qwen compares once Handy has it in.

Footprint05212mo ago

Why parakeet over whisper v3 turbo? Just curious as one who heavily uses whisper, I’ve seemed to have better results with that

whinvik2mo ago

Parakeet is much smaller and for me the perf/speed combo has just been better.

woodson2mo ago

This is about speech to text, not speech recognition.

lostmsu2mo ago

I still don't know anyone who managed Qwen3-Omni to work properly on a local machine.

JonChesterfield2mo ago

I see a lot of references to `device_map="cuda:0"` but no cuda in the github repo, is the complete stack flash attention plus this python plus the weights file, or does one need vLLM running as well?

khimaros2mo ago

i made an epub to audiobook generator using this with optional LLM integration for dramatized output: https://github.com/khimaros/autiobook -- also submitted here: https://news.ycombinator.com/item?id=46737968

naveen-zerocool2mo ago

I just created a video trying it out - https://youtu.be/0LU9nmnR0cs

albertwang2mo ago

great news, this looks great! is it just me, or do most of the english audio samples sound like anime voices?

numpad02mo ago

I suspect they might be using voice lines from Chinese gacha games in addition to what clearly sound like VTubers, YouTubers, and Chinese TV documentary narrations. Those games all come with clean monaural CN/JP/EN files consistent in contents across language for all regions, for, an obvious[1] reason.

1: https://old.reddit.com/r/ZenlessZoneZero/comments/1gqmtl1/th...

rapind2mo ago

> do most of the english audio samples sound like anime voices?

100% I was thinking the same thing.

bityard2mo ago

Well, if you look at the prompts, they are basically told to sound like that.

And if you ask me, I think these models were trained on tween fiction podcasts. (My kids listen to a lot of these and dramatic over-acting seems to be the industry standard.)

Also, their middle-aged adult with an "American English" accent sounds like any American I've ever met. More like a bad Sean Connery impersonator.

reactordev2mo ago

The real value I see is being able to clone a voice and change timbre and characteristics of the voice to be able to quickly generate voice overs, narrations, voice acting, etc. It's superb!

devttyeu2mo ago

Also like some popular youtubers and popular speakers.

pixl972mo ago

Hmm, wonder where they got their training data from?

thehamkercat2mo ago

even the Japanese audio samples sound like anime

htrp2mo ago

subbed audio training data (much better than cc data) is better

sails2mo ago

Any recommendations for an iOS app to test models like this? There are a few good ones for text gen, and it’s a great way to try models

bigyabai2mo ago

Besides UTM, no.

swaraj2mo ago

Tried the voice clone with a 30s trump clip (with reference text), and it didn't sound like him at all.

jakobdabo2mo ago

Can anyone please provide directions/links to tools that can be run locally, and that take an audio recording of a voice as an input, and produce an output with the same voice saying the same thing with the same intonations, but with a fixed/changed accent?

This is needed for processing an indie game's voice recordings, where the voice actors weren't native speakers and had some accent.

sinnickal2mo ago

Prepare for an influx of sensational hot-mic clips allegedly from high profile people

dangoodmanUT2mo ago

Many voices clone better than 11labs, while admitedly lower bitrate

ideashower2mo ago

Huh. One of the English Voice Clone examples features Obama.

subscribed2mo ago

Distinct, characteristic voice. My first to play with will be Morgan Freeman.

illwrks2mo ago

I think the other sounds like Steve Jobs - I could be wrong though!

jonkoops2mo ago

Honestly, this seems like it could be pretty cool for video games. I always liked Oblivion's 'Radiant AI', this could be a natural progression, give characters motivations, relations with the player and other NPCs and have an LLM spit out background dialogue, then have another model generate the audio.

wahnfrieden2mo ago

How is it for Japanese?

numpad02mo ago

The demo page only has three samples for Japanese, and one of it pronounces taskete as itsukete(???), so...

wahnfrieden2mo ago

Thanks. All modern TTS for Japanese are useless failures.

1 more reply

salzig2mo ago

there is a sample clone -> Trump speaks Japanese.

Edit: "Cross-lingual Voice Clone" https://qwen.ai/blog?id=qwen3tts-0115#voice-clone

salzig2mo ago

So now we're getting every movie in "original voice" but local language? Can't wait to view anime or Bollywood :D

j / k navigate · click thread line to collapse

225 comments

simonw2mo ago

I shared a recording of audio I generated with that here: https://simonwillison.net/2026/Jan/22/qwen3-tts/

javier1234543212mo ago

rdtsc2mo ago

4 more replies

u80802mo ago

https://www.youtube.com/watch?v=diboERFAjkE pretty much this

2 more replies

oceanplexian2mo ago

> This is terrifying.

4 more replies

razster2mo ago

TacticalCoder2mo ago

> With this and z-image-turbo, we've crossed a chasm.

fridder2mo ago

2 more replies

echelon2mo ago

We're going to be okay.

Nothing was more scary than the invention of the nuclear weapon. And we're all still here.

Life will go on. And there will be incredible benefits that come out of this.

6 more replies

magicalhippo2mo ago

I presume this is due to using the base model, and not the one tuned for more expressiveness.

edit: Or more likely, the demo not exposing the expressiveness controls.

Overall though, I'm quite impressed. I haven't checked out all the recent TTS models, but a fair number, and this one is certainly one of the better ones in terms of voice cloning quality I've heard.

thedangler2mo ago

How did you do this locally? Tools? Language?

1 more reply

dsrtslnd232mo ago

Any idea on the VRAM footprint for the 1.7B model? I guess it fits on consumer cards but I am wondering if it works on edge devices.

1 more reply

pseudosavant2mo ago

_kb2mo ago

parentheses2mo ago

I got some errors trying to run this on my MBP. Claude was able to one-shot a fix.

cristoperb2mo ago

viraptor2mo ago

They weirdly makes it a canny peak though :)

bsenftner2mo ago

mohsen12mo ago

> The requested GPU duration (180s) is larger than the maximum allowed

What am I doing wrong?

gregsadetsky2mo ago

you need to login

KolmogorovComp2mo ago

Hello, the recording you posted does not tell much about the cloning capability without an example from your real voice.

simonw2mo ago

Given how easy voice cloning is with this thing I chickened out of sharing the training audio I recorded!

That's not really rational considering the internet is full of examples of my voice that anyone could use though. Here's a recent podcast clip: https://www.youtube.com/watch?v=lVDhQMiAbR8&t=3006s

1 more reply

kingstnap2mo ago

It was fun to try out. I wonder if at some point if I have a few minutes of me talking I could make myself read an entire book to myself.

itsTyrion2mo ago

well that isnt concerning at all

simonw2mo ago

I got this running on macOS using mlx-audio thanks to Prince Canuma: https://x.com/Prince_Canuma/status/2014453857019904423

Here's the script I'm using: https://github.com/simonw/tools/blob/main/python/q3_tts.py

You can try it with uv (downloads a 4.5GB model on first run) like this:

  uv run https://tools.simonwillison.net/python/q3_tts.py \
    'I am a pirate, give me your gold!' \
    -i 'gruff voice' -o pirate.wav

genewitch2mo ago

hopefully i can make this work on windows (or linux, i guess).

thanks so much.

cube002mo ago

> hopefully i can make this work on windows (or linux, i guess).

mlx-audio only works on Apple Silicon

1 more reply

rahimnathwani2mo ago

If you want to do custom voice cloning, record a sample wav file with a sentence or two, and then try this:

  uv tool install --force git+https://github.com/Blaizzy/mlx-audio.git --prerelease=allow
    
  python -m mlx_audio.tts.generate --model mlx-community/Qwen3-TTS-12Hz-0.6B-Base-bf16 --text "Hello, this is a test." --ref_audio path_to_audio.wav --ref_text "Transcript of the reference audio." --play

indigodaddy2mo ago

Simon how do you think this would perform on CPU only? Lets say threadripper with 20G ram. (Voice cloning in particular)

simonw2mo ago

No idea at all, but my guess is it would work but be a bit slow.

You'd need to use a different build of the model though, I don't think MLX has a CPU implementation.

genewitch2mo ago

gcr2mo ago

This is wonderful, thank you. Another win for uv!

TheAceOfHearts2mo ago

Using speaker Ryan seems to be the most consistent, I tried speaker Eric and it sounded like someone putting on a fake exaggerated Chinese accent to mock speakers.

If it wasn't for the unpredictable level of emotions from each chunk, I'd say this is easily the highest quality TTS model I've tried.

KaoruAoiShiho2mo ago

Have you tried specifying the emotion? There's an option to do so and if it's left empty it wouldn't surprise me if it defaulted to rng instead of bland.

TheAceOfHearts2mo ago

For the system prompt I used:

> Read this in a calm, clear, and wise audiobook tone.

> Do not rush. Allow the meaning to sink in.

But maybe I should experiment with something more detailed. Do you have any suggestions?

1 more reply

dsrtslnd232mo ago

do you have the RTF for the 1080? I am trying to figure out if the 0.6B model is viable for real-time inference on edge devices.

TheAceOfHearts2mo ago

Yeah, it's not great. I wrote a harness that calculates it as: 3.61s Load Time, 38.78s Gen Time, 18.38s Audio Len, RTF 2.111.

The Tao Te Ching audiobook came in at 62 mins in length and it ran for 102 mins, which gives an RTF of 1.645.

I do get a warning about flash-attn not being installed, which says that it'll slow down inference. I'm not sure if that feature can be supported on the 1080 and I wasn't up for tinkering to try.

2 more replies

genewitch2mo ago

I have dozens of hours of audio of like Bob Bailey and people of that era.

kamranjon2mo ago

I wonder if it was trained on anime dubs cause all of the examples I listened to sounded very similar to a miyazaki style dub.

genewitch2mo ago

scroll down to the second to last group, the second one down is obama speaking english, the third one down is trump speaking japanese (a translation of the english phrase)

freedomben2mo ago

genewitch2mo ago

  Observe, original: https://www.youtube.com/watch?v=YiRcOVDAryM
  my edit (took about an hour, if memory serves, to set up. forgot render time...): https://www.youtube.com/watch?v=xazubVJ0jz4

i say "was [...] software" because the last 2 times i've tried to use it, it did imperceptible cleanup, making it worthless. Anyhow, all my radio plays are from OTRR, i think.

Audio.Restoration.DeNoise.DeNoiseLF.2.8.3_WiN.OSX is a more recent version i think

p.s. are you a "dude named Ben"?

1 more reply

throwaw122mo ago

Qwen team, please please please, release something to outperform and surpass the coding abilities of Opus 4.5.

Although I like the model, I don't like the leadership of that company and how close it is, how divisive they're in terms of politics.

mortsnort2mo ago

They were just waiting for someone in the comments to ask!

zeppelin1012mo ago

Someone has to take the first step. Let's be grateful to the brave anon HN poster for stepping up.

mhuffman2mo ago

It really is the best way to incentivize politeness!

stuckkeys2mo ago

I loled hard at this. Thank you kind stranger.

pseudony2mo ago

Same issue (I am Danish).

Have you tested alternatives? I grabbed Open Code and a Minimax m2.1 subscription, even just the 10usd/mo one to test with.

Result? We designed a spec for a slight variation of a tool for which I wrote a spec with Claude - same problem (process supervisor tool), from scratch.

Honestly, it worked great, I have played a little further with generating code (this time golang), again, I am happy.

Beyond that, Glm4.7 should also be great.

See https://dev.to/kilocode/open-weight-models-are-getting-serio...

It is a recent case story of vibing a smaller tool with kilo code, comparing output from minimax m2.1 and Glm4.7

Honestly, just give it a whirl - no need to send money to companies/nations your disagree with with.

nunodonato2mo ago

3 more replies

TylerLives2mo ago

>how divisive they're in terms of politics

What do you mean by this?

throwaw122mo ago

Dario said not nice words about China and open models in general:

https://www.bloomberg.com/news/articles/2026-01-20/anthropic...

3 more replies

Balinares2mo ago

They're supporters of the Trump administration's military, a view which is not universally lauded.

mohsen12mo ago

With a good harness I am getting similar results with GLM 4.7. I am paying for TWO! max accounts and my agents are running 24/7.

I still have a small Claude account to do some code reviews. Opus 4.5 does good reviews but at this point GLM 4.7 usually can do the same code reviews.

If cost is an issue (for me it is, I pay out of pocket) go with GLM 4.7

imiric2mo ago

Your GitHub profile is... disturbing. 1,354 commits and 464 pull requests in January so far.

If this is the future of software development, society is cooked.

2 more replies

amrrs2mo ago

Have you tried the new GLM 4.7?

davely2mo ago

I've been using GLM 4.7 alongside Opus 4.5 and I can't believe how bad it is. Seriously.

It constantly told me that it fixed the problem. In frustration, I opened Claude Code and just typed "Why won't the button with ID 'edit' work???!"

It fixed the problem in one shot. This isn't even a hard problem (and I could have just fixed it myself but I guess sunk cost fallacy).

3 more replies

throwaw122mo ago

yes I did, not on par with Opus 4.5.

WarmWash2mo ago

China would need an architectural breakthrough to leap American labs given the huge compute disparity.

miklosz2mo ago

I have seen indeed a trailer overtake its truck. Not a beautiful view.

1 more reply

overfeed2mo ago

1. Chinese researcher in China, to be more specific.

3 more replies

aaa_aaa2mo ago

No all they need is time. I am awaiting the dowfall of the ai hegemony and hype with popcorn at hand.

mhuffman2mo ago

I would be happy with an openweight 3 month old Claude

1 more reply

genewitch2mo ago

can you point me at another free voice cloning / tts model with this fidelity and, i guess prompt adherence?

because i've been on youtube and insta, and believe me, no one else even compares, yet.

Onavo2mo ago

Well DeepSeek V4 is rumored to be in that range and will be released in 3 weeks.

aussieguy12342mo ago

I could say the same about grok (although given there are better models for my use cases I don't use it). What part of divisive politics are you talking about here?

sampton2mo ago

Every time Dario opens his mouth it's something weird.

chriswep2mo ago

girvo2mo ago

Amusingly one of their examples (the final Age Control example) is prompted to have American English as an accent, but sounds like an Australian trying to sounds American to my ear haha

rahimnathwani2mo ago

Has anyone successfully run this on a Mac? The installation instructions appear to assume an NVIDIA GPU (CUDA, FlashAttention), and I’m not sure whether it works with PyTorch’s Metal/MPS backend.

magicalhippo2mo ago

FWIW you can run the demo without FlashAttention using --no-flash-attn command-line parameter, I do that since I'm on Windows and haven't gotten FlashAttention2 to work.

turnsout2mo ago

It seems to depend on FlashAttention, so the short answer is no. Hopefully someone does the work of porting the inference code over!

Lichtso2mo ago

Yes, using mlx-audio. See https://news.ycombinator.com/item?id=46726440

rahimnathwani2mo ago

Thanks! Simon's example uses the custom voice model (creating a voice from instructions). But that comment led me eventually to this page, which shows how to use mlx-audio for custom voices:

https://huggingface.co/mlx-community/Qwen3-TTS-12Hz-0.6B-Bas...

  uv tool install --force git+https://github.com/Blaizzy/mlx-audio.git --prerelease=allow
    
  python -m mlx_audio.tts.generate --model mlx-community/Qwen3-TTS-12Hz-0.6B-Base-bf16 --text "Hello, this is a test." --ref_audio path_to_audio.wav --ref_text "Transcript of the reference audio." --play

javier1234543212mo ago

I recommend using modal for renting the metal.

PunchyHamster2mo ago

Looking forward for my grandma being scammed by one!

jacquesm2mo ago

So far that seems to be the main use case.

bigyabai2mo ago

Grandmas should know better, nowadays. It's 2026, half of today's grandparents grew up with QVC and landline psychics.

viraptor2mo ago

akadeb2mo ago

You could pipe the output to an audio file with ffmpeg or pyaudio and save it locally

viraptor2mo ago

I don't want to save the audio. I want to save the voice model so I can use it for many different texts, for consistency.

1 more reply

d4rkp4ttern2mo ago

[1] https://github.com/kyutai-labs/pocket-tts

[2] https://github.com/pchalasani/claude-code-tools?tab=readme-o...

satvikpendem2mo ago

This would be great for audiobooks, some of the current AI TTS still struggle.

anotherevan2mo ago

Is there any way to take a cloned voice model and plug into Android TTS and/or Windows?

thedangler2mo ago

Kind of a noob, how would I implement this locally? How do I pass it audio to process. I'm assuming its in the API spec?

dust422mo ago

Scroll down on the Huggingface page, there are code examples and also a link to github: https://huggingface.co/Qwen/Qwen3-TTS-12Hz-0.6B-Base

daliusd2mo ago

I wanted to try this locally as well so I have asked AI to write CLI for me: https://github.com/daliusd/qtts

There are some samples. If you have GPU you might want to fork and improve this, but otherwise slow, but usable on CPU as well.

7777777phil2mo ago

Here is a Colab Notebook where you can test it on any of the available GPUs (H100, A100, T4): https://colab.research.google.com/drive/1szmNh25TmMpPd4aKjWX...

indigodaddy2mo ago

How does the cloning compare to pocket TTS?

andhuman2mo ago

It’s uncanny good. I prefer it to pocket, but then again pocket is much smaller and for realtime streaming.

indigodaddy2mo ago

Ah right I guess I meant for instant which I assume qwen can't do

quinncom2mo ago

Pocket TTS is much smaller: 100M parameters versus 600–1800M.

indigodaddy2mo ago

Ah right so I guess qwen3-tts isn't going to work for cpu-only like pocket TTS can(?)

1 more reply

gunalx2mo ago

Voice actors are slo cooked. Some of the demos arguably sounded way better than a lot of indie voice-acting.

whinvik2mo ago

Haha something that I want to try out. I have started using voice input more and more instead of typing and now I am on my second app and second TTS model, namely Handy and Parakeet V3.

Parakeet is pretty good, but there are times it struggles. Would be interesting to see how Qwen compares once Handy has it in.

Footprint05212mo ago

Why parakeet over whisper v3 turbo? Just curious as one who heavily uses whisper, I’ve seemed to have better results with that

whinvik2mo ago

Parakeet is much smaller and for me the perf/speed combo has just been better.

woodson2mo ago

This is about speech to text, not speech recognition.

lostmsu2mo ago

I still don't know anyone who managed Qwen3-Omni to work properly on a local machine.

JonChesterfield2mo ago

I see a lot of references to `device_map="cuda:0"` but no cuda in the github repo, is the complete stack flash attention plus this python plus the weights file, or does one need vLLM running as well?

khimaros2mo ago

naveen-zerocool2mo ago

I just created a video trying it out - https://youtu.be/0LU9nmnR0cs

albertwang2mo ago

great news, this looks great! is it just me, or do most of the english audio samples sound like anime voices?

numpad02mo ago

1: https://old.reddit.com/r/ZenlessZoneZero/comments/1gqmtl1/th...

rapind2mo ago

> do most of the english audio samples sound like anime voices?

100% I was thinking the same thing.

bityard2mo ago

Well, if you look at the prompts, they are basically told to sound like that.

And if you ask me, I think these models were trained on tween fiction podcasts. (My kids listen to a lot of these and dramatic over-acting seems to be the industry standard.)

Also, their middle-aged adult with an "American English" accent sounds like any American I've ever met. More like a bad Sean Connery impersonator.

reactordev2mo ago

The real value I see is being able to clone a voice and change timbre and characteristics of the voice to be able to quickly generate voice overs, narrations, voice acting, etc. It's superb!

devttyeu2mo ago

Also like some popular youtubers and popular speakers.

pixl972mo ago

Hmm, wonder where they got their training data from?

thehamkercat2mo ago

even the Japanese audio samples sound like anime

htrp2mo ago

subbed audio training data (much better than cc data) is better

sails2mo ago

Any recommendations for an iOS app to test models like this? There are a few good ones for text gen, and it’s a great way to try models

bigyabai2mo ago

Besides UTM, no.

swaraj2mo ago

Tried the voice clone with a 30s trump clip (with reference text), and it didn't sound like him at all.

jakobdabo2mo ago

This is needed for processing an indie game's voice recordings, where the voice actors weren't native speakers and had some accent.

sinnickal2mo ago

Prepare for an influx of sensational hot-mic clips allegedly from high profile people

dangoodmanUT2mo ago

Many voices clone better than 11labs, while admitedly lower bitrate

ideashower2mo ago

Huh. One of the English Voice Clone examples features Obama.

subscribed2mo ago

Distinct, characteristic voice. My first to play with will be Morgan Freeman.

illwrks2mo ago

I think the other sounds like Steve Jobs - I could be wrong though!

jonkoops2mo ago

wahnfrieden2mo ago

How is it for Japanese?

numpad02mo ago

The demo page only has three samples for Japanese, and one of it pronounces taskete as itsukete(???), so...

wahnfrieden2mo ago

Thanks. All modern TTS for Japanese are useless failures.

1 more reply

salzig2mo ago

there is a sample clone -> Trump speaks Japanese.

Edit: "Cross-lingual Voice Clone" https://qwen.ai/blog?id=qwen3tts-0115#voice-clone

salzig2mo ago

So now we're getting every movie in "original voice" but local language? Can't wait to view anime or Bollywood :D

j / k navigate · click thread line to collapse