Qwen3-VL (opens in new tab)

(qwen.ai)

434 pointsnatrys8mo ago160 comments

160 comments

As I mentioned yesterday - I recently needed to process hundreds of low quality images of invoices (for a construction project). I had a script that had used pil/opencv, pytesseract, and open ai as a fallback. It still has a staggering number of failures.

Today I tried a handful of the really poor quality invoices and Qwen spat out all the information I needed without an issue. What's crazier is it gave me the bounding boxes to improve tesseract.

iamflimflam18mo ago

I would recommend taking a look at this service: https://learn.microsoft.com/en-us/rest/api/computervision/re...

iamleppert8mo ago

Microsoft Vision is so expensive and has a ridiculous rate limit, is slow, and isn't any better than what you can run yourself. You have to make every request over HTTP (with a rate limit), and there is no ability to do bulk jobs. It's also incredibly expensive.

benterix8mo ago

I wonder why you chose Qwen specifically - Mistral has a specialized model just for OCR that they advertised heavily (I tested it and it works surprisingly well, at least on English-language books from 80s and 90s).

z28mo ago

Mistral's model was terrible when I tested it on non Latin characters and on anything that isn't neat printed text (i.e. handwriting)

VladVladikoff8mo ago

Interesting. I have in the past tried to get bounding boxes of property boundaries on satellite maps estimated by VLLM models but had no success. Do you have any tips on how to improve the results?

richardlblair8mo ago

With Qwen I went as stupid as I could: please provide the bounding box metadata for pytesseract for the above image.

And it spat it out.

VladVladikoff8mo ago

It’s funny that many of us say please. I don’t think it impacts the output, but it also feels wrong without it sometimes.

3 more replies

Workaccount28mo ago

Gemini has purpose post training for bounding boxes if you haven't tried it.

The latest update on Gemini live does real time bounding boxes on objects it's talking about, it's pretty neat.

rsalama28mo ago

shameless plug here for AMD's AI Dev Day - registration is open and they want feedback on what to focus on: https://www.amd.com/en/corporate/events/amd-ai-dev-day.html

mh-8mo ago

Do you have some example images and the prompt you tried?

BOOSTERHIDROGEN8mo ago

also documented stack setup if could.

wiz21c8mo ago

I like to test these models on reading the contents of 80's Apple ][ games screenshots. These are very low resolution, very dense. All (free to use) models struggle on that task...

richardlblair8mo ago

My dataset could be described in a similar way. Very low quality, very odd layouts, information density where it's completely unnecessary.

And these contractors were relatively good operators compared to most.

netdur8mo ago

I’ve tried that too, trying to detect the scan layout to get better OCR, but it didn’t really beat a fine-tuned Qwen 2.5 VLM 7B. I’d say fine-tuning is the way to go

richardlblair8mo ago

What's the cost of the fine-tuned model? If you were attempting to optimize for cost, would it be worth it to detect scan layouts to get better OCR?

Honestly, I'm such a noob in this space. I had 1 project I needed to do, didn't want to do it by hand which would have taken 2 days so I spent 5 trying to get a script to do it for me.

netdur8mo ago

the model runs on H200 in ~20s, costing about $2.4/hr. on L4 it’s cheaper at ~$0.3/hr but takes ~85s to finish. overall, H200 ends up cheaper at volume. my scan has a separate issue though: each page has two columns, so text from the right side sometimes overflows into the left. OCR can’t really tell where sentences start and end unless the layout is split by column.

rexreed8mo ago

what fine tuning approach did you use?

netdur8mo ago

just unsloth on colab using A100 and dataset on google drive.

unixhero8mo ago

So where did you load up Qwen and how did you supply the pdf or photo files? I don't know how to use these models, but want to learn

baby_souffle8mo ago

LM Studio[0] is the best "i'm new here and what is this!?" tool for dipping your toes in the water.

If the model supports "vision" or "sound", that tool makes it relatively painless to take your input file + text and feed it to the model.

[0]: https://lmstudio.ai/

dabockster8mo ago

Jumping from this for visibility - LM Studio really is the best option out there. Ollama is another runtime that I've used, but I've found it makes too many assumptions about what a computer is capable of and it's almost impossible to override those settings. It often overloads weaker computers and underestimates stronger ones.

LM Studio isn't as "set it and forget it" as Ollama is, and it does have a bit of a learning curve. But if you're doing any kind of AI development and you don't want to mess around with writing llama-cpp scripts all the time, it really can't be beat (for now).

unixhero8mo ago

Thank you! I will give it a try and see if I can get that 4090 working a bit.

Alifatisk8mo ago

You can use their models here chat.qwenlm.ai, its their official website

dabockster8mo ago

I wouldn't recommend using anything that can transmit data back to the CCP. The model itself is fine since it's open source (and you can run it firewalled if you're really paranoid), but directly using Alibaba's AI chat website should be discouraged.

captainregex8mo ago

AnythingLLM also good for that GUI experience!

captainregex8mo ago

I should add that sometimes LM Studio just feels better for the use case, same model same purpose seemingly different output usually when involving RAG, but Anything is definitely a very intuitive visual experience

lofaszvanitt8mo ago

People actually use tesseract? It's one of the worst OCR solutions out there. Forget it.

creativebee8mo ago

Any tipps on getting bounding boxes? The model doesn’t seem to even understand the original size of the image. And even if I provide the dimensions, the positioning is off. :'(

kardianos8mo ago

Wait a moment... It gave you BOUNDING BOXES? That is awesome! That is a missing link I need for models.

pouetpouetpoue8mo ago

i had success with tabula. you may not need ai. but fine if it works too.

re5i5tor8mo ago

I would strongly emphasize:

CV != AI Vision

gpt-4o would breeze through your poor images.

richardlblair8mo ago

It did not, unfortunately. When CV failed gpt-4o failed as well. I even had a list of valid invoice numbers & dates to help the models. Still, most failed.

Construction invoices are not great.

re5i5tor8mo ago

Did you try few-shotting examples when you hit problem cases? In my ziploc case, the model was failing if red sharpie was used vs black. A few shot hint fixed that.

1 more reply

re5i5tor8mo ago

I’m very surprised. Have dealt with some really ugly inputs (handwritten text on full ziploc bags etc., stained torn handwritten recipe cards, etc.) with super good success.

deepdarkforest8mo ago

The Chinese are doing what they have been doing to the manufacturing industry as well. Take the core technology and just optimize, optimize, optimize for 10x the cost/efficiency. As simple as that. Super impressive. These models might be bechmaxxed but as another comment said, i see so many that it might as well be the most impressive benchmaxxing today, if not just a genuinely SOTA open source model. They even released a closed source 1 trillion parameter model today as well that is sitting on no3(!) on lm arena. EVen their 80gb model is 17th, gpt-oss 120b is 52nd https://qwen.ai/blog?id=241398b9cd6353de490b0f82806c7848c5d2...

jychang8mo ago

They still suck at explaining which model they serve is which, though.

They also released today Qwen3-VL Plus [1] today alongside Qwen3-VL 235B [2] and they don't tell us which one is better. Note that Qwen3-VL-Plus is a very different model compared to Qwen-VL-Plus.

Also, qwen-plus-2025-09-11 [3] vs qwen3-235b-a22b-instruct-2507 [4]. What's the difference? Which one is better? Who knows.

You know it's bad when OpenAI has a more clear naming scheme.

[1] https://modelstudio.console.alibabacloud.com/?tab=doc#/doc/?...

[2] https://modelstudio.console.alibabacloud.com/?tab=doc#/doc/?...

[3] https://modelstudio.console.alibabacloud.com/?tab=doc#/doc/?...

[4] https://modelstudio.console.alibabacloud.com/?tab=doc#/doc/?...

jwr8mo ago

> They still suck at explaining which model they serve is which, though.

"they" in this sentence probably applies to all "AI" companies.

Even the naming/versioning of OpenAI models is ridiculous, and then you can never find out which is actually better for your needs. Every AI company writes several paragraphs of fluffy text with lots of hand waving, saying how this model is better for complex tasks while this other one is better for difficult tasks.

viraptor8mo ago

Both Deepseek and Claude are exceptions. Simple versions and Sonnet is overall worse but faster than Opus for the same version.

deepdarkforest8mo ago

Eh i mean often innovation is made just by letting a lot of fragmented, small teams of cracked nerds trying out stuff. It's way too early in the game. I mean, qwens release statements have anime etc. IBM, Bell, Google, Dell, many did it similarly, letting small focused teams having many attempts at cracking the same problem. All modern quant firms are doing basically the same as well. Anthropic is actually an exception, more like Apple.

marci8mo ago

it's sometimes not really a matter of which one is better but which one fits best.

For example many have switched to qwen3 models but some still vastly prefer the reasoning and output of QwQ (a qwen2.5 model).

And the difference between them: those with "plus" are closed weight, you can only access them through their api. The others are open-weight, so if they fit your use case, and if ever the want or need arise, you can download them, use them, even fine-tune them locally, even if qwen don't offer access to them any more.

jychang8mo ago

If the naming is so clear to you, then why don't you explain: for a user who wants to use Qwen3-VL through an API, which one has better performance? Qwen3-VL Plus or Qwen3-VL 235b?

1 more reply

nl8mo ago

> Take the core technology and just optimize, optimize, optimize for 10x the cost/efficiency. As simple as that. Super impressive.

This "just" is incorrect.

The Qwen team invented things like DeepStack https://arxiv.org/abs/2406.04334

(Also I hate this "The Chinese" thing. Do we say "The British" if it came from a DeepMind team in the UK? Or what if there are Chinese born US citizens working in Paris for Mistral?

Give credit to the Qwen team rather than a whole country. China has both great labs and mediocre labs, just like the rest of the world.)

viraptor8mo ago

The naming makes some sense here. It's backed by the very Chinese Alibaba and the government directly as well. It's almost a national project.

taneq8mo ago

The Americans do that all the time. :P

Mashimo8mo ago

> Do we say "The British"

Yes.

mamami8mo ago

Yeah it's just weird Orientalism all over again

riku_iki8mo ago

> Also I hate this "The Chinese" thing

to me it was positive assessment, I adore their craftsmanship and persistence in moving forward for long period of time.

mrtesthah8mo ago

It erases the individuals doing the actual research by viewing Chinese people as a monolith.

spaceman_20208mo ago

Interestingly, I've found that models like Kimi K2 spit out more organic, natural-sounding text than American models

Fails on the benchmarks compared to other SOTA models but the real-world experience is different

dabockster8mo ago

> Take the core technology and just optimize, optimize, optimize for 10x the cost/efficiency.

This is what really grinds my gears about American AI and American technology in general lately, as an American myself. We used to do that! But over the last 10-15 years, it seems like all this country can do is try to throw more and more resources at something instead of optimizing what we already have.

Download more ram for this progressive web app.

Buy a Threadripper CPU to run this game that looks worse than the ones you played on the Nintendo Gamecube in the early 2000s.

Generate more electricity (hello Elon Musk).

Y'all remember your algorithms classes from college, right? Why not apply that here? Because China is doing just that, and frankly making us look stupid by comparison.

helloericsf8mo ago

If you're in SF, you don't want to miss this. The Qwen team is making their first public appearance in the United States, with the VP of Qwen Lab speaking at the meetup below during SF teach week. https://partiful.com/e/P7E418jd6Ti6hA40H6Qm Rare opportunity to directly engage with the Qwen team members.

alfiedotwtf8mo ago

Let’s hope they’re allowed in the country and get a visa… it’s 50/50 these days

dazzaji8mo ago

Registration full :-(

be7a8mo ago

The biggest takeaway is that they claim SOTA for multi-modal stuff even ahead of proprietary models and still released it as open-weights. My first tests suggest this might actually be true, will continue testing. Wow

ACCount378mo ago

Most multi-modal input implementations suck, and a lot of them suck big time.

Doesn't seem to be far ahead of existing proprietary implementations. But it's still good that someone's willing to push that far and release the results. Getting multimodal input to work even this well is not at all easy.

Computer08mo ago

I feel like most Open Source releases regardless of size claim to be similar in output quality to SOTA closed source stuff.

Workaccount28mo ago

Sadly it still fails the "extra limb" test.

I have a few images of animals with an extra limb photoshopped onto them. A dog with an leg coming out of it's stomach, or a cat with two front right legs.

Like every other model I have tested, it insists that the animals have their anatomically correct amount of limbs. Even pointing out there is a leg coming from the dogs stomach, it will push back and insist I am confused. Insist it counted again and there are definitely only 4. Qwen took it a step further and even after I told it the image was edited, it told me it wasn't and there were only 4 limbs.

Jackson__8mo ago

It fails on any edge case, like all other VLMs. The last time a vision model succeeded at reading analog clocks, a notoriously difficult task, it was revealed they trained on nearly 1 million artificial clock images[0] to make it work. In a similar vein, I have encountered no model that could read for example a D20 correctly.[1]

It could probably identify extra limbs in your pictures if you too made a million example images to train it on, but until then it will keep failing. And of course you'll get to keep making millions more example images for every other issue you run into.

[0] https://huggingface.co/datasets/allenai/pixmo-clocks

[1] https://files.catbox.moe/ocbr35.jpg

WithinReason8mo ago

I can't tell which number is up either since it's on a white background, am I an LLM?

brookst8mo ago

Definitely not a good model for accurately counting limbs on mutant species, then. Might be good at other things that have greater representation in the training set.

user342838mo ago

I'm not knowledgeable about ML but it seems disappointing how we went from "models are able to generalize" and "emergent capabilities" to "can't do anything not greatly represented in the training set".

ComputerGuru8mo ago

I wonder if you used their image editing feature if it would insist on “correcting” the number of limbs even if you asked for unrelated changes.

vunderba8mo ago

It will. I actually made a test when NanoBanana first went GA which featured a photo of a one-legged man and asked the model to change the clothing into pants. It added the pants as requested and then proceeded to "heal" his missing leg in the process.

Very difficult for even SOTA to go against data that is as well-represented as bipedal humanoids.

https://mordenstar.com/blog/edits-with-nanobanana

willahmad8mo ago

China is winning the hearts of developers in this race so far. At least, they won mine already.

cedws8mo ago

Arguably they’ve already won. Check the names at the top the next time you see a paper from an American company, a lot of them are Chinese.

ReverseCold8mo ago

you can’t tell if someone is American or Chinese by looking at their name

I actually claim something even stronger, which is it’s what’s in your heart that really determines if you’re American :)

cedws8mo ago

Cute but the US president is currently on a mass deportation campaign, so it appears what's in peoples' hearts doesn't really matter.

schrectacular8mo ago

The PRC espionage system doesn't care what passport you have or even where you are born. They have a broader and more ethnic-focus definition.

Workaccount28mo ago

They don't have to ever make a profit, so the game they are playing is a bit different.

maxloh8mo ago

OpenAI were not found to be profit-driven too. It is sad to see the place they are now.

pmdr8mo ago

Still nowhere near profits. Until @sama shows third-party audited data, I don't care what he tweets. Same for Anthropic.

vanviegen8mo ago

Of course they do, eventually. Also, it seems like they're not burning nearly as much money as some of their US competitors.

Workaccount28mo ago

AI is part of China's 5 year plan and been given special blessing by Xi Jinping directly.

That pretty much translates to blank checks from the party without much oversight or expected ROI. The approach is basically brute forcing something into existence rather than organically letting it grow. China is notorious for this approach, ghost cities, high speed rail to nowhere, solar panel production in the face of a huge glut.

Ultimately though, there is an expectation that AI will serve the goals of the party, after all it is their trillions that will be funding it. I guess the core difference is that in the US AI is expected to generate profit, in China it is expected to generate controlled social cohesion.

1 more reply

swyx8mo ago

so.. why do you think they are trying this hard to win your heart?

willahmad8mo ago

They might have dozens of reasons, but they already did what they did.

Some of the reasons could be:

- mitigation of US AI supremacy

- Commodify AI use to push forward innovation and sell platforms to run them, e.g. if iPhone wins local intelligence, it benefits China, because China is manufacturing those phones

- talent war inside China

- soften the sentiment against China in the US

- they're just awesome people

- and many more

vanviegen8mo ago

> - they're just awesome people

Thank you for including that option in your list! F#ck cynicism.

michaelt8mo ago

I can see how it would be in China's interest to make sure there was an LLM that produced cutting edge performance in Chinese-language conversations.

And some uses of LLMs are intensely political; think of a student using an LLM to learn about the causes of the civil war. I can understand a country wanting their own LLMs for the same reason they write their own history textbooks.

By releasing the weights they they can get free volunteer help, win hearts and minds with their open approach, weaken foreign corporations, give their citizens robust performance in their native language, and exercise narrative control - all at the same time.

bayarearefugee8mo ago

I don't think they care about winning hearts exactly, but I do think they (correctly) realize that LLM models are racing rapidly toward being commodified and they are still going to be way ahead of us on manufacturing the hardware to run them on.

Watching the US stock market implode from the bubble generated from investors over here not realizing this is happening will be a nice bonus for them, I guess, and constantly shipping open SOTA models will speed that along.

brokencode8mo ago

Maybe they just want to see one of the biggest stock bubble pops of all time in the US.

binary1328mo ago

Surprising this is the first time I’ve seen anyone say this out loud.

1 more reply

protocolture8mo ago

I know I do

llllm8mo ago

they aren’t even trying hard, it’s just that no one else is trying

Keyframe8mo ago

Open source is communism after all? In any case, maybe everyone realized what Zuckerberg was also saying from the start and that is that models will be more of a utility, rather than advantage.

sporkxrocket8mo ago

China has been creating high quality cultural artifacts for thousands of years.

sergiotapia8mo ago

Thank you Qwen team for your generosity. I'm already using their thinking model to build some cool workflows that help boring tasks within my org.

https://openrouter.ai/qwen/qwen3-235b-a22b-thinking-2507

Now with this I will use it to identify and caption meal pictures and user pictures for other workflows. Very cool!

natrysOP8mo ago

Models:

- https://huggingface.co/Qwen/Qwen3-VL-235B-A22B-Thinking

- https://huggingface.co/Qwen/Qwen3-VL-235B-A22B-Instruct

causal8mo ago

That has got to be the most benchmarks I've ever seen posted with an announcement. Kudos for not just cherrypicking a favorable set.

esafak8mo ago

We should stop reporting saturated benchmarks.

causal8mo ago

Yeah, especially since many of those are just poor targets after being out so long / contaminating too much.

BUFU8mo ago

The open source models are no longer catching up. They are leading now.

buyucu8mo ago

It has been like that for a while now. At least since Deepseek R1.

vardump8mo ago

So 235B parameter Qwen3-VL is FP16, so practically it requires at least 512 GB RAM to run? Possibly even more for a reasonable context window?

Assuming I don’t want to run it on a CPU, what are my options to run it at home under $10k?

Or if my only option is to run the model with CPU (vs GPU or other specialized HW), what would be the best way to use that 10k? vLLM + Multiple networked (10/25/100Gbit) systems?

loudmax8mo ago

An Apple Mac Studio with 512GB of unified memory is around the $10k. If your really need that much power on your home computer, and you have that much money to spend, this could be the easiest option.

You probably don't need fp16. Most models can be quantized down to q8 with minimal loss of quality. Models can usually be quantized to q4 or even below and run reasonably well, depending on what you expect out of them.

Even at q8, you'll need around 235GB of memory. An Nvidia RTX 5090 has 32GB of VRAM and has an official price of about $2000, but usually retails for more. If you can find them at that price, you'd need eight of them to run a 235GB model entirely in VRAM, and that doesn't include a motherboard and CPU that can handle eight GPUs. You could look for old mining rigs built from RTX 3090s or P40s. Otherwise, I don't see much prospect for fitting this much data into VRAM on consumer GPUs for under $10k.

Without NVLink, you're going to take a massive performance hit running a model distributed over several computers. It can be done, and there's research into optimizing distributed models, but the throughput is a significant bottleneck. For now, you really want to run on a single machine.

You can get pretty good performance out of a CPU. The key is memory bandwidth. Look at server or workstation class CPUs with a lot of DDR5 memory channels that support a high MT/s rate. For example, an AMD Ryzen Threadripper 7965WX has eight DDR5 memory channels at up to 5200 MT/s and retails for about $2500. Depending on your needs, this might give you acceptable performance.

Lastly, I'd question whether you really need to run this at home. Obviously, this depends on your situation and what you need it for. Any investment you put into hardware is going to depreciate significantly in just a few years. $10k of credits in the cloud will take you a long way.

bitflourjikg8mo ago

A non-CPU setup will very likely require an electrical service upgrade or tactical positioning of different systems on different circuits for you to be able to run models that large. Several kW setups also cost non-trivial sums of money to run usually

isoprophlex8mo ago

Extremely impressive, but can one really run these >200B param models on prem in any cost effective way? Even if you get your hands on cards with 80GB ram, you still need to tie them together in a low-latency high-BW manner.

It seems to me that small/medium sized players would still need a third party to get inference going on these frontier-quality models, and we're not in a fully self-owned self-hosted place yet. I'd love to be proven wrong though.

Borealid8mo ago

A Framework Desktop exposes 96GB of RAM for inference and costs a few thou USD.

michaelanckaert8mo ago

You need memory on the GPU, not in the system itself (unless you have unified memory such as the M-architecture). So we're talking about cards like the H200 that have 141GB of memory and cost between 25 to 40k.

Borealid8mo ago

Did you casually glance at how the hardware in the Framework Desktop (Strix Halo) works before commenting?

2 more replies

buyucu8mo ago

I'm running them on GMKTec Evo 2.

vessenes8mo ago

Roughly 1/10 the cost of Opus 4.1, 1/2 the cost of Sonnet 4 on per token inference basis. Impressive. I'd love to see a fast (groq style) version of this served. I wonder if the architecture is amenable.

petesergeant8mo ago

Cerebras are hosting other Qwen models via OpenRouter, so probably

aitchnyu8mo ago

Isnt it a 3x rate difference? 0.7$ for Qwen3-VL vs 3$ for Sonnet 4?

vessenes8mo ago

Openrouter had $8-ish / 1M tokens for Qwen and $15/M for Sonnet 4 when I checked

vessenes8mo ago

I spent a little time with the thinking model today. It's good. It's not better than GPT5 Pro. It might be better than the smallest GPT 5, though.

My current go-to test is to ask the LLM to construct a charging solution for my macbook pro with the model on it, but sadly, I and the pro have been sent to 15th century Florence with no money and no charger. I explain I only have two to three hours of inference time, which can be spread out, but in that time I need to construct a working charge solution.

So far GPT-5 Pro has been by far the best, not just in its electrical specifications (drawings of a commutator), but it generated instructions for jewelers and blacksmith in what it claims is 15th century florentine italian, and furnished a year-by year set of events with trading / banking predictions, a short rundown of how to get to the right folks in the Medici family, .. it was comprehensive.

Generally models suggest building an Alternating current setup and then rectifying to 5V of DC power, and trickle charging over the USB-C pins that allow trickle charging. There's a lot of variation in how they suggest we get to DC power, and often times not a lot of help on key questions, like, say "how do I know I don't have too much voltage using only 15th century tools?"

Qwen 3 VL is a mixed bag. It's the only model other than GPT5 I've talked to that suggested building a voltaic pile, estimated voltage generated by number of plates, gave me some tests to check voltage (lick a lemon, touch your tongue. Mild tingling - good. Strong tingling, remove a few plates), and was overall helpful.

On the other hand, its money making strategy was laughable; predicting Halley's comet, and in exchange demanding a workshop and 20 copper pennies from the Medicis.

Anyway, interesting showing, definitely real, and definitely useful.

ralusek8mo ago

I JUST had a very intense dream that there was a catastrophic event that set humanity back massively, to the point that the internet was nonexistent and our laptops suddenly became priceless. The first thought I had was absolutely hating myself for not bothering to download a local LLM. A local LLM at the level of qwen is enough to massively jump start civilization.

vessenes8mo ago

Yeah bring Qwen and OSS-120b for sure. You’re going to want some solar panels with usb-c output tho

nl8mo ago

> predicting Halley's comet, and in exchange demanding a workshop and 20 copper pennies from the Medicis

I love this! Simple and probably effective (or would get you killed for witchcraft)

vessenes8mo ago

Hard in that you might have to starve for a few decades though. I’d prefer interest rate arb based on competing city state aggression

buu7008mo ago

Funny enough, I did a little bit of ChatGPT-assisted research into a loosely similar scenario not too long ago. LPT: if you happen to know in advance that you'll be in Renaissance Florence, make sure to pack as many synthetic diamonds as you can afford.

ripped_britches8mo ago

That is a freaking insanely cool answer from gpt5

mythz8mo ago

Team Qwen keeps cooking! qwen2.5VL was already my preferred visual model for querying images, will look at upgrading if they release a smaller model we can run locally.

fareesh8mo ago

Can't seem to connect to qwen.ai with DNSSEC enabled

> resolvectl query qwen.ai > qwen.ai: resolve call failed: DNSSEC validation failed: no-signature

And

https://dnsviz.net/d/qwen.ai/dnssec/ shows

aliyunga0019.com/DNSKEY: No response was received from the server over UDP (tried 4 times). See RFC 1035, Sec. 4.2. (8.129.152.246, UDP_-_EDNS0_512_D_KN)

mountainriver8mo ago

Incredible release! Qwen has been leading the open source vision models for a while now. Releasing a really big model is amazing for a lot of use cases.

I would love to see a comparison to the latest GLM model. I would also love to see no one use OS World ever again, it’s a deeply flawed benchmark.

drapado8mo ago

Cool! Pity they are not releasing a smaller A3B MoE model

ilc8mo ago

https://huggingface.co/Qwen/Qwen3-Omni-30B-A3B-Instruct

daemonologist8mo ago

Their A3B Omni paper mentions that the Omni at that size outperformed the (unreleased I guess) VL. Edit: I see now that there is no Omni-235B-A22B; disregard the following. ~~Which is interesting - I'd have expected the larger model to have more weights to "waste" on additional modalities and thus for the opposite to be true (or for the VL to outperform in both cases, or for both to benefit from knowledge transfer).~~

Relevant comparison is on page 15: https://arxiv.org/abs/2509.17765

jadbox8mo ago

How does it compare to Omni?

ramon1568mo ago

One downside is it has less knowledge of lesser known tools like orpc, which is easily fixed by something like context7

ashvardanian8mo ago

Qwen models have historically been pretty good, but there seems to be no architectural novelty here, if I’m not missing it. Seems like another vision encoder, with a projection, and a large autoregressive model. Have there been any better ideas in the VLM space recently? I’ve been away for a couple of years :(

clueless8mo ago

This demo is crazy: "At what time was the goal scored in this match, who scored it, and how was it scored?"

addandsubtract8mo ago

I had the same reaction, given the 100min+ runtime of the video.

michaelanckaert8mo ago

Qwen has some really great models. I recently used qwen/qwen3-next-80b-a3b-thinking as a drop-in replacement for GPT-4.1-mini in an agent workflow. Cost 4 times less for input tokens and half for output, instant cost savings. As far as I can measure, system output has kept the same quality.

am17an8mo ago

This model is literally amazing. Everyone should try to get their hands on a H100 and just call it a day.

whitehexagon8mo ago

Imagine the demand for a 128GB/256GB/512GB unified memory stuffed hardware linux box shipping with Qwen models already up and running.

Although I´m agAInst steps towards AGI, it feels safer to have these things running locally and disconnected from each other, than some giant GW cloud agentic data centers connected to everyone and everything.

buyucu8mo ago

I bought an GMKtec evo 2 that is a 128 GB unified memory system. Strong recommend.

te00068mo ago

Interesting - do you need to take any special measures to get OSS genAI models to work on this architecture? Can you use inference engines like Ollama and vLLM off-the-shelf (as Docker containers) there, with just the Radeon 8060S GPU? What token rates do you achieve?

(edit: corrected mistake w.r.t. the system's GPU)

buyucu8mo ago

I just use llama.cpp. It worked out of the box.

Keyframe8mo ago

That's AMD Ryzen AI Max+ 395, right? Lots of those boxes popping up recently, but isn't that dog slow? And I can't believe I'm saying this - but maybe RAM filled-up mac might be a better option?

loudmax8mo ago

A GMKtec or a Framework desktop with a Strix Halo/AI Max CPU is about the cheapest way to run a model that needs to fit into about 120GB of memory. Macs have twice the memory bandwidth of these units, so will run significantly faster, but they're also much more expensive. Technically, you could run these models on any desktop PC with 128GB of RAM, but that's a whole different level of "dog slow." It really depends on how much you're prepared to pay to run these bigger models locally.

whitehexagon8mo ago

from the .de website I see 2000eur for the 128GB, But looking at the shipping info, it sounds like it might still be shipped from .cn: ´Please ensure you can handle customs clearance and taxes yourself.´

Also it is Windows 11 which is a big No from me.

But if this is the start of the local big model capable hardware it looks quite hopeful. A 2nd hand M2 128GB studio (which I can use Asahi on) is currently ~3600eur

https://es.wallapop.com/item/mac-studio-m2-ultra-1tb-ssd-128...

ricardobeat8mo ago

Yes, but the mac costs 3-4x more. You can get one of these 395 systems with 96GB for ~1k.

2 more replies

buyucu8mo ago

I'm not buying a Mac. Period.

Alifatisk8mo ago

Wow, the Qwen team doesn't stop and keep coming up with surprises. Not only did they release this but also the new Qwen3-Max model

buyucu8mo ago

The Chinese are great. They are making major contributions to human civilization by open sourcing these models.

youssefarizk8mo ago

Another day another Qwen model

j / k navigate · click thread line to collapse

160 comments

richardlblair8mo ago

Today I tried a handful of the really poor quality invoices and Qwen spat out all the information I needed without an issue. What's crazier is it gave me the bounding boxes to improve tesseract.

iamflimflam18mo ago

I would recommend taking a look at this service: https://learn.microsoft.com/en-us/rest/api/computervision/re...

iamleppert8mo ago

benterix8mo ago

z28mo ago

Mistral's model was terrible when I tested it on non Latin characters and on anything that isn't neat printed text (i.e. handwriting)

VladVladikoff8mo ago

Interesting. I have in the past tried to get bounding boxes of property boundaries on satellite maps estimated by VLLM models but had no success. Do you have any tips on how to improve the results?

richardlblair8mo ago

With Qwen I went as stupid as I could: please provide the bounding box metadata for pytesseract for the above image.

And it spat it out.

VladVladikoff8mo ago

It’s funny that many of us say please. I don’t think it impacts the output, but it also feels wrong without it sometimes.

3 more replies

Workaccount28mo ago

Gemini has purpose post training for bounding boxes if you haven't tried it.

The latest update on Gemini live does real time bounding boxes on objects it's talking about, it's pretty neat.

rsalama28mo ago

shameless plug here for AMD's AI Dev Day - registration is open and they want feedback on what to focus on: https://www.amd.com/en/corporate/events/amd-ai-dev-day.html

mh-8mo ago

Do you have some example images and the prompt you tried?

BOOSTERHIDROGEN8mo ago

also documented stack setup if could.

wiz21c8mo ago

I like to test these models on reading the contents of 80's Apple ][ games screenshots. These are very low resolution, very dense. All (free to use) models struggle on that task...

richardlblair8mo ago

My dataset could be described in a similar way. Very low quality, very odd layouts, information density where it's completely unnecessary.

And these contractors were relatively good operators compared to most.

netdur8mo ago

I’ve tried that too, trying to detect the scan layout to get better OCR, but it didn’t really beat a fine-tuned Qwen 2.5 VLM 7B. I’d say fine-tuning is the way to go

richardlblair8mo ago

What's the cost of the fine-tuned model? If you were attempting to optimize for cost, would it be worth it to detect scan layouts to get better OCR?

Honestly, I'm such a noob in this space. I had 1 project I needed to do, didn't want to do it by hand which would have taken 2 days so I spent 5 trying to get a script to do it for me.

netdur8mo ago

rexreed8mo ago

what fine tuning approach did you use?

netdur8mo ago

just unsloth on colab using A100 and dataset on google drive.

unixhero8mo ago

So where did you load up Qwen and how did you supply the pdf or photo files? I don't know how to use these models, but want to learn

baby_souffle8mo ago

LM Studio[0] is the best "i'm new here and what is this!?" tool for dipping your toes in the water.

If the model supports "vision" or "sound", that tool makes it relatively painless to take your input file + text and feed it to the model.

[0]: https://lmstudio.ai/

dabockster8mo ago

unixhero8mo ago

Thank you! I will give it a try and see if I can get that 4090 working a bit.

Alifatisk8mo ago

You can use their models here chat.qwenlm.ai, its their official website

dabockster8mo ago

captainregex8mo ago

AnythingLLM also good for that GUI experience!

captainregex8mo ago

lofaszvanitt8mo ago

People actually use tesseract? It's one of the worst OCR solutions out there. Forget it.

creativebee8mo ago

Any tipps on getting bounding boxes? The model doesn’t seem to even understand the original size of the image. And even if I provide the dimensions, the positioning is off. :'(

kardianos8mo ago

Wait a moment... It gave you BOUNDING BOXES? That is awesome! That is a missing link I need for models.

pouetpouetpoue8mo ago

i had success with tabula. you may not need ai. but fine if it works too.

re5i5tor8mo ago

I would strongly emphasize:

CV != AI Vision

gpt-4o would breeze through your poor images.

richardlblair8mo ago

It did not, unfortunately. When CV failed gpt-4o failed as well. I even had a list of valid invoice numbers & dates to help the models. Still, most failed.

Construction invoices are not great.

re5i5tor8mo ago

Did you try few-shotting examples when you hit problem cases? In my ziploc case, the model was failing if red sharpie was used vs black. A few shot hint fixed that.

1 more reply

re5i5tor8mo ago

I’m very surprised. Have dealt with some really ugly inputs (handwritten text on full ziploc bags etc., stained torn handwritten recipe cards, etc.) with super good success.

deepdarkforest8mo ago

jychang8mo ago

They still suck at explaining which model they serve is which, though.

They also released today Qwen3-VL Plus [1] today alongside Qwen3-VL 235B [2] and they don't tell us which one is better. Note that Qwen3-VL-Plus is a very different model compared to Qwen-VL-Plus.

Also, qwen-plus-2025-09-11 [3] vs qwen3-235b-a22b-instruct-2507 [4]. What's the difference? Which one is better? Who knows.

You know it's bad when OpenAI has a more clear naming scheme.

[1] https://modelstudio.console.alibabacloud.com/?tab=doc#/doc/?...

[2] https://modelstudio.console.alibabacloud.com/?tab=doc#/doc/?...

[3] https://modelstudio.console.alibabacloud.com/?tab=doc#/doc/?...

[4] https://modelstudio.console.alibabacloud.com/?tab=doc#/doc/?...

jwr8mo ago

> They still suck at explaining which model they serve is which, though.

"they" in this sentence probably applies to all "AI" companies.

viraptor8mo ago

Both Deepseek and Claude are exceptions. Simple versions and Sonnet is overall worse but faster than Opus for the same version.

deepdarkforest8mo ago

marci8mo ago

it's sometimes not really a matter of which one is better but which one fits best.

For example many have switched to qwen3 models but some still vastly prefer the reasoning and output of QwQ (a qwen2.5 model).

jychang8mo ago

If the naming is so clear to you, then why don't you explain: for a user who wants to use Qwen3-VL through an API, which one has better performance? Qwen3-VL Plus or Qwen3-VL 235b?

1 more reply

nl8mo ago

> Take the core technology and just optimize, optimize, optimize for 10x the cost/efficiency. As simple as that. Super impressive.

This "just" is incorrect.

The Qwen team invented things like DeepStack https://arxiv.org/abs/2406.04334

(Also I hate this "The Chinese" thing. Do we say "The British" if it came from a DeepMind team in the UK? Or what if there are Chinese born US citizens working in Paris for Mistral?

Give credit to the Qwen team rather than a whole country. China has both great labs and mediocre labs, just like the rest of the world.)

viraptor8mo ago

The naming makes some sense here. It's backed by the very Chinese Alibaba and the government directly as well. It's almost a national project.

taneq8mo ago

The Americans do that all the time. :P

Mashimo8mo ago

> Do we say "The British"

Yes.

mamami8mo ago

Yeah it's just weird Orientalism all over again

riku_iki8mo ago

> Also I hate this "The Chinese" thing

to me it was positive assessment, I adore their craftsmanship and persistence in moving forward for long period of time.

mrtesthah8mo ago

It erases the individuals doing the actual research by viewing Chinese people as a monolith.

spaceman_20208mo ago

Interestingly, I've found that models like Kimi K2 spit out more organic, natural-sounding text than American models

Fails on the benchmarks compared to other SOTA models but the real-world experience is different

dabockster8mo ago

> Take the core technology and just optimize, optimize, optimize for 10x the cost/efficiency.

Download more ram for this progressive web app.

Buy a Threadripper CPU to run this game that looks worse than the ones you played on the Nintendo Gamecube in the early 2000s.

Generate more electricity (hello Elon Musk).

Y'all remember your algorithms classes from college, right? Why not apply that here? Because China is doing just that, and frankly making us look stupid by comparison.

helloericsf8mo ago

alfiedotwtf8mo ago

Let’s hope they’re allowed in the country and get a visa… it’s 50/50 these days

dazzaji8mo ago

Registration full :-(

be7a8mo ago

ACCount378mo ago

Most multi-modal input implementations suck, and a lot of them suck big time.

Computer08mo ago

I feel like most Open Source releases regardless of size claim to be similar in output quality to SOTA closed source stuff.

Workaccount28mo ago

Sadly it still fails the "extra limb" test.

I have a few images of animals with an extra limb photoshopped onto them. A dog with an leg coming out of it's stomach, or a cat with two front right legs.

Jackson__8mo ago

[0] https://huggingface.co/datasets/allenai/pixmo-clocks

[1] https://files.catbox.moe/ocbr35.jpg

WithinReason8mo ago

I can't tell which number is up either since it's on a white background, am I an LLM?

brookst8mo ago

Definitely not a good model for accurately counting limbs on mutant species, then. Might be good at other things that have greater representation in the training set.

user342838mo ago

ComputerGuru8mo ago

I wonder if you used their image editing feature if it would insist on “correcting” the number of limbs even if you asked for unrelated changes.

vunderba8mo ago

Very difficult for even SOTA to go against data that is as well-represented as bipedal humanoids.

https://mordenstar.com/blog/edits-with-nanobanana

willahmad8mo ago

China is winning the hearts of developers in this race so far. At least, they won mine already.

cedws8mo ago

Arguably they’ve already won. Check the names at the top the next time you see a paper from an American company, a lot of them are Chinese.

ReverseCold8mo ago

you can’t tell if someone is American or Chinese by looking at their name

I actually claim something even stronger, which is it’s what’s in your heart that really determines if you’re American :)

cedws8mo ago

Cute but the US president is currently on a mass deportation campaign, so it appears what's in peoples' hearts doesn't really matter.

schrectacular8mo ago

The PRC espionage system doesn't care what passport you have or even where you are born. They have a broader and more ethnic-focus definition.

Workaccount28mo ago

They don't have to ever make a profit, so the game they are playing is a bit different.

maxloh8mo ago

OpenAI were not found to be profit-driven too. It is sad to see the place they are now.

pmdr8mo ago

Still nowhere near profits. Until @sama shows third-party audited data, I don't care what he tweets. Same for Anthropic.

vanviegen8mo ago

Of course they do, eventually. Also, it seems like they're not burning nearly as much money as some of their US competitors.

Workaccount28mo ago

AI is part of China's 5 year plan and been given special blessing by Xi Jinping directly.

1 more reply

swyx8mo ago

so.. why do you think they are trying this hard to win your heart?

willahmad8mo ago

They might have dozens of reasons, but they already did what they did.

Some of the reasons could be:

- mitigation of US AI supremacy

- Commodify AI use to push forward innovation and sell platforms to run them, e.g. if iPhone wins local intelligence, it benefits China, because China is manufacturing those phones

- talent war inside China

- soften the sentiment against China in the US

- they're just awesome people

- and many more

vanviegen8mo ago

> - they're just awesome people

Thank you for including that option in your list! F#ck cynicism.

michaelt8mo ago

I can see how it would be in China's interest to make sure there was an LLM that produced cutting edge performance in Chinese-language conversations.

bayarearefugee8mo ago

brokencode8mo ago

Maybe they just want to see one of the biggest stock bubble pops of all time in the US.

binary1328mo ago

Surprising this is the first time I’ve seen anyone say this out loud.

1 more reply

protocolture8mo ago

I know I do

llllm8mo ago

they aren’t even trying hard, it’s just that no one else is trying

Keyframe8mo ago

Open source is communism after all? In any case, maybe everyone realized what Zuckerberg was also saying from the start and that is that models will be more of a utility, rather than advantage.

sporkxrocket8mo ago

China has been creating high quality cultural artifacts for thousands of years.

sergiotapia8mo ago

Thank you Qwen team for your generosity. I'm already using their thinking model to build some cool workflows that help boring tasks within my org.

https://openrouter.ai/qwen/qwen3-235b-a22b-thinking-2507

Now with this I will use it to identify and caption meal pictures and user pictures for other workflows. Very cool!

natrysOP8mo ago

Models:

- https://huggingface.co/Qwen/Qwen3-VL-235B-A22B-Thinking

- https://huggingface.co/Qwen/Qwen3-VL-235B-A22B-Instruct

causal8mo ago

That has got to be the most benchmarks I've ever seen posted with an announcement. Kudos for not just cherrypicking a favorable set.

esafak8mo ago

We should stop reporting saturated benchmarks.

causal8mo ago

Yeah, especially since many of those are just poor targets after being out so long / contaminating too much.

BUFU8mo ago

The open source models are no longer catching up. They are leading now.

buyucu8mo ago

It has been like that for a while now. At least since Deepseek R1.

vardump8mo ago

So 235B parameter Qwen3-VL is FP16, so practically it requires at least 512 GB RAM to run? Possibly even more for a reasonable context window?

Assuming I don’t want to run it on a CPU, what are my options to run it at home under $10k?

Or if my only option is to run the model with CPU (vs GPU or other specialized HW), what would be the best way to use that 10k? vLLM + Multiple networked (10/25/100Gbit) systems?

loudmax8mo ago

bitflourjikg8mo ago

isoprophlex8mo ago

Borealid8mo ago

A Framework Desktop exposes 96GB of RAM for inference and costs a few thou USD.

michaelanckaert8mo ago

Borealid8mo ago

Did you casually glance at how the hardware in the Framework Desktop (Strix Halo) works before commenting?

2 more replies

buyucu8mo ago

I'm running them on GMKTec Evo 2.

vessenes8mo ago

petesergeant8mo ago

Cerebras are hosting other Qwen models via OpenRouter, so probably

aitchnyu8mo ago

Isnt it a 3x rate difference? 0.7$ for Qwen3-VL vs 3$ for Sonnet 4?

vessenes8mo ago

Openrouter had $8-ish / 1M tokens for Qwen and $15/M for Sonnet 4 when I checked

vessenes8mo ago

I spent a little time with the thinking model today. It's good. It's not better than GPT5 Pro. It might be better than the smallest GPT 5, though.

On the other hand, its money making strategy was laughable; predicting Halley's comet, and in exchange demanding a workshop and 20 copper pennies from the Medicis.

Anyway, interesting showing, definitely real, and definitely useful.

ralusek8mo ago

vessenes8mo ago

Yeah bring Qwen and OSS-120b for sure. You’re going to want some solar panels with usb-c output tho

nl8mo ago

> predicting Halley's comet, and in exchange demanding a workshop and 20 copper pennies from the Medicis

I love this! Simple and probably effective (or would get you killed for witchcraft)

vessenes8mo ago

Hard in that you might have to starve for a few decades though. I’d prefer interest rate arb based on competing city state aggression

buu7008mo ago

ripped_britches8mo ago

That is a freaking insanely cool answer from gpt5

mythz8mo ago

Team Qwen keeps cooking! qwen2.5VL was already my preferred visual model for querying images, will look at upgrading if they release a smaller model we can run locally.

fareesh8mo ago

Can't seem to connect to qwen.ai with DNSSEC enabled

> resolvectl query qwen.ai > qwen.ai: resolve call failed: DNSSEC validation failed: no-signature

And

https://dnsviz.net/d/qwen.ai/dnssec/ shows

aliyunga0019.com/DNSKEY: No response was received from the server over UDP (tried 4 times). See RFC 1035, Sec. 4.2. (8.129.152.246, UDP_-_EDNS0_512_D_KN)

mountainriver8mo ago

Incredible release! Qwen has been leading the open source vision models for a while now. Releasing a really big model is amazing for a lot of use cases.

I would love to see a comparison to the latest GLM model. I would also love to see no one use OS World ever again, it’s a deeply flawed benchmark.

drapado8mo ago

Cool! Pity they are not releasing a smaller A3B MoE model

ilc8mo ago

https://huggingface.co/Qwen/Qwen3-Omni-30B-A3B-Instruct

daemonologist8mo ago

Relevant comparison is on page 15: https://arxiv.org/abs/2509.17765

jadbox8mo ago

How does it compare to Omni?

ramon1568mo ago

One downside is it has less knowledge of lesser known tools like orpc, which is easily fixed by something like context7

ashvardanian8mo ago

clueless8mo ago

This demo is crazy: "At what time was the goal scored in this match, who scored it, and how was it scored?"

addandsubtract8mo ago

I had the same reaction, given the 100min+ runtime of the video.

michaelanckaert8mo ago

am17an8mo ago

This model is literally amazing. Everyone should try to get their hands on a H100 and just call it a day.

whitehexagon8mo ago

Imagine the demand for a 128GB/256GB/512GB unified memory stuffed hardware linux box shipping with Qwen models already up and running.

buyucu8mo ago

I bought an GMKtec evo 2 that is a 128 GB unified memory system. Strong recommend.

te00068mo ago

(edit: corrected mistake w.r.t. the system's GPU)

buyucu8mo ago

I just use llama.cpp. It worked out of the box.

Keyframe8mo ago

That's AMD Ryzen AI Max+ 395, right? Lots of those boxes popping up recently, but isn't that dog slow? And I can't believe I'm saying this - but maybe RAM filled-up mac might be a better option?

loudmax8mo ago

whitehexagon8mo ago

Also it is Windows 11 which is a big No from me.

But if this is the start of the local big model capable hardware it looks quite hopeful. A 2nd hand M2 128GB studio (which I can use Asahi on) is currently ~3600eur

https://es.wallapop.com/item/mac-studio-m2-ultra-1tb-ssd-128...

ricardobeat8mo ago

Yes, but the mac costs 3-4x more. You can get one of these 395 systems with 96GB for ~1k.

2 more replies

buyucu8mo ago

I'm not buying a Mac. Period.

Alifatisk8mo ago

Wow, the Qwen team doesn't stop and keep coming up with surprises. Not only did they release this but also the new Qwen3-Max model

buyucu8mo ago

The Chinese are great. They are making major contributions to human civilization by open sourcing these models.

youssefarizk8mo ago

Another day another Qwen model

j / k navigate · click thread line to collapse