I shared a recording of audio I generated with that here: https://simonwillison.net/2026/Jan/22/qwen3-tts/
Far more terrifying is Big Tech having access to a closed version of the same models, in the hands of powerful people with a history of unethical behavior (i.e. Zuckerberg's "Dumb Fucks" comments). In fact it's a miracle and a bit ironic that the Chinese would be the ones to release a plethora of capable open source models, instead of the scraps like we've seen from Google, Meta, OpenAI, etc.
And most of all: they're both local models. The cat is out of the box and it's never going back in. There's no censoring of this. No company that can pull the plug. Anyone with a semi-modern GPU can use these models.
There are far more good and interesting use cases for this technology. Games will let users clone their voices and create virtual avatars and heroes. People will have access to creative tools that let them make movies and shows with their likeness. People that couldn't sing will make music.
Nothing was more scary than the invention of the nuclear weapon. And we're all still here.
Life will go on. And there will be incredible benefits that come out of this.
I presume this is due to using the base model, and not the one tuned for more expressiveness.
edit: Or more likely, the demo not exposing the expressiveness controls.
The 1.7B model was much better at ignoring slight background noise in the reference audio compared to the 0.6B model though. The 0.6B would inject some of that into the generated audio, whereas the 1.7B model would not.
Also, without FlashAttention it was dog slow on my 5090, running at 0.3X realtime with just 30% GPU usage. Though I guess that's to be expected. No significant difference in generation speed between the two models.
Overall though, I'm quite impressed. I haven't checked out all the recent TTS models, but a fair number, and this one is certainly one of the better ones in terms of voice cloning quality I've heard.
``` Loaded speech tokenizer from ~/.cache/huggingface/hub/models--Qwen--Qwen3-TTS-12Hz-1.7B-VoiceDesign/snapshots/0e711a1c0aa5aad30654426 e0d11f67716c1211e/speech_tokenizer Fetching 11 files: 0%| | 0/11 [00:00<?, ?it/s]Fetching 11 files: 100%|| 11/11 [00:00<00:00, 125033.45it/s] The tokenizer you are loading from '!/.cache/huggingface/hub/models--Qwen--Qwen3-TTS-12Hz-1.7B-VoiceDesign/snapshots/0e711a1c0aa5aad30654426e0d11f67716c1211e' with an incorrect regex pattern: https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Instr.... This will lead to incorrect tokenization. You should set the `fix_mistral_regex=True` flag when loading this tokenizer to fix this issue. ```
What am I doing wrong?
That's not really rational considering the internet is full of examples of my voice that anyone could use though. Here's a recent podcast clip: https://www.youtube.com/watch?v=lVDhQMiAbR8&t=3006s
Here's the script I'm using: https://github.com/simonw/tools/blob/main/python/q3_tts.py
You can try it with uv (downloads a 4.5GB model on first run) like this:
uv run https://tools.simonwillison.net/python/q3_tts.py \
'I am a pirate, give me your gold!' \
-i 'gruff voice' -o pirate.wavhopefully i can make this work on windows (or linux, i guess).
thanks so much.
mlx-audio only works on Apple Silicon
uv tool install --force git+https://github.com/Blaizzy/mlx-audio.git --prerelease=allow
python -m mlx_audio.tts.generate --model mlx-community/Qwen3-TTS-12Hz-0.6B-Base-bf16 --text "Hello, this is a test." --ref_audio path_to_audio.wav --ref_text "Transcript of the reference audio." --playYou'd need to use a different build of the model though, I don't think MLX has a CPU implementation.
anyhow, with faster CPUs and optimizations, you won't be waiting too long. Also 20GB is overkill for an audio model. Only text - LLM - are huge and take infinite memory. SD/FLUX models are under 16GB of ram usage (uh, mine are, at least!), for instance.
Using speaker Ryan seems to be the most consistent, I tried speaker Eric and it sounded like someone putting on a fake exaggerated Chinese accent to mock speakers.
If it wasn't for the unpredictable level of emotions from each chunk, I'd say this is easily the highest quality TTS model I've tried.
> Read this in a calm, clear, and wise audiobook tone.
> Do not rush. Allow the meaning to sink in.
But maybe I should experiment with something more detailed. Do you have any suggestions?
The Tao Te Ching audiobook came in at 62 mins in length and it ran for 102 mins, which gives an RTF of 1.645.
I do get a warning about flash-attn not being installed, which says that it'll slow down inference. I'm not sure if that feature can be supported on the 1080 and I wasn't up for tinkering to try.
Now, maybe the results were cherrypicked. i know everyone else who has released one of these cherrypicks which to publish. However, this is the first time i've considered it plausible to use AI TTS to remaster old radioplays and the like, where a section of audio is unintelligible but can be deduced from context, like a tape glitch where someone says "HEY [...]LAR!" and it's an episode of Yours Truly, Johnny Dollar...
I have dozens of hours of audio of like Bob Bailey and people of that era.
besides, they know what side their bread is buttered on. I feel like this is almost not the real announcement; or, the engineers that wrote this up and did the demos just ran it that way. The normal speech voices are fine (lower than the anime ones on the page.) i agree that the first few are very infantile. I'll change that word if i can think of a better one.
Observe, original: https://www.youtube.com/watch?v=YiRcOVDAryM
my edit (took about an hour, if memory serves, to set up. forgot render time...): https://www.youtube.com/watch?v=xazubVJ0jz4
i say "was [...] software" because the last 2 times i've tried to use it, it did imperceptible cleanup, making it worthless. Anyhow, all my radio plays are from OTRR, i think.Audio.Restoration.DeNoise.DeNoiseLF.2.8.3_WiN.OSX is a more recent version i think
p.s. are you a "dude named Ben"?
Although I like the model, I don't like the leadership of that company and how close it is, how divisive they're in terms of politics.
Have you tested alternatives? I grabbed Open Code and a Minimax m2.1 subscription, even just the 10usd/mo one to test with.
Result? We designed a spec for a slight variation of a tool for which I wrote a spec with Claude - same problem (process supervisor tool), from scratch.
Honestly, it worked great, I have played a little further with generating code (this time golang), again, I am happy.
Beyond that, Glm4.7 should also be great.
See https://dev.to/kilocode/open-weight-models-are-getting-serio...
It is a recent case story of vibing a smaller tool with kilo code, comparing output from minimax m2.1 and Glm4.7
Honestly, just give it a whirl - no need to send money to companies/nations your disagree with with.
What do you mean by this?
https://www.bloomberg.com/news/articles/2026-01-20/anthropic...
I still have a small Claude account to do some code reviews. Opus 4.5 does good reviews but at this point GLM 4.7 usually can do the same code reviews.
If cost is an issue (for me it is, I pay out of pocket) go with GLM 4.7
Regardless of how productive those numbers may seem, that amount of code being published so quickly is concerning, to say the least. It couldn't have possibly been reviewed by a human or properly tested.
If this is the future of software development, society is cooked.
I spent 20 minutes yesterday trying to get GLM 4.7 to understand that a simple modal on a web page (vanilla JS and HTML!) wasn't displaying when a certain button was clicked. I hooked it up to Chrome MCP in Open Code as well.
It constantly told me that it fixed the problem. In frustration, I opened Claude Code and just typed "Why won't the button with ID 'edit' work???!"
It fixed the problem in one shot. This isn't even a hard problem (and I could have just fixed it myself but I guess sunk cost fallacy).
I use Opus 4.5 for planning, when I reach my usage limits fallback to GLM 4.7 only for implementing the plan, it still struggles, even though I configure GLM 4.7 as both smaller model and heavier model in claude code
China would need an architectural breakthrough to leap American labs given the huge compute disparity.
1. Chinese researcher in China, to be more specific.
because i've been on youtube and insta, and believe me, no one else even compares, yet.
https://huggingface.co/mlx-community/Qwen3-TTS-12Hz-0.6B-Bas...
uv tool install --force git+https://github.com/Blaizzy/mlx-audio.git --prerelease=allow
python -m mlx_audio.tts.generate --model mlx-community/Qwen3-TTS-12Hz-0.6B-Base-bf16 --text "Hello, this is a test." --ref_audio path_to_audio.wav --ref_text "Transcript of the reference audio." --play[1] https://github.com/kyutai-labs/pocket-tts
[2] https://github.com/pchalasani/claude-code-tools?tab=readme-o...
I have a friend with a paralysed larynx who is often using his phone or a small laptop to type in order to communicate. I know he would love it if it was possible to take old recordings of him speaking and use that to give him back "his" voice, at least in some small measure.
There are some samples. If you have GPU you might want to fork and improve this, but otherwise slow, but usable on CPU as well.
Parakeet is pretty good, but there are times it struggles. Would be interesting to see how Qwen compares once Handy has it in.
1: https://old.reddit.com/r/ZenlessZoneZero/comments/1gqmtl1/th...
100% I was thinking the same thing.
And if you ask me, I think these models were trained on tween fiction podcasts. (My kids listen to a lot of these and dramatic over-acting seems to be the industry standard.)
Also, their middle-aged adult with an "American English" accent sounds like any American I've ever met. More like a bad Sean Connery impersonator.
This is needed for processing an indie game's voice recordings, where the voice actors weren't native speakers and had some accent.
Edit: "Cross-lingual Voice Clone" https://qwen.ai/blog?id=qwen3tts-0115#voice-clone