undefined | Better HN

0 pointsrhdunn4mo ago0 comments

A 4090 has 24GB of VRAM allowing you to run a 22B model entirely in memory at FP8 and 24B models at Q6_K (~19GB).

A 5090 has 32GB of VRAM allowing you to run a 32B model in memory at Q6_K.

You can run larger models by splitting the GPU layers that are run in VRAM vs stored in RAM. That is slower, but still viable.

This means that you can run the Qwen3-Coder-30B-A3B model locally on a 4090 or 5090. That model is a Mixture of Experts model with 3B active parameters, so you really only need a card with 3B of VRAM so you could run it on a 3090.

The Qwen3-Coder-480B-A35B model could also be run on a 4090 or 5090 by splitting the active 35B parameters across VRAM and RAM.

Yes, it will be slower than running it in the cloud. But you can get a long way with a high-end gaming rig.

0 comments

iberator4mo ago

That's out of touch for 90% of developers worldwide

brazukadev4mo ago

Today. But what about in 5 years? Would you bet we will be paying hundreds of billions to OpenAI yearly or buying consumer GPUs? I know what I will be doing.

Dilettante_4mo ago

But the progress goes both ways: In five years, you would still want to use whatever is running on the cloud supercenters. Just like today you could run gpt-2 locally as a coding agent, but we want the 100x-as-powerful shiny thing.

2 more replies

infecto4mo ago

Paying for compute in the cloud. That’s what I am betting on. Multiple providers, different data center players. There may be healthy margins for them but I would bet it’s always going to be relatively cheaper for me to pay for the compute rather than manage it myself.

1 more reply

alfiedotwtf4mo ago

Woah, woah, woah. I thought in 5 years time we would all be out of a job lol

jen729w4mo ago

Honestly though how many people reading this do you think have that setup vs. 85% of us being on a MBx?

> The Qwen3-Coder-480B-A35B model could also be run on a 4090 or 5090 by splitting the active 35B parameters across VRAM and RAM.

Reminds me of running Doom when I had to hack config.sys to forage 640KB of memory.

Less than 0.1% of the people reading this are doing that. Me, I gave $20 to some cloud service and I can do whatever the hell I want from this M1 MBA in a hotel room in Japan.

radicalbyte4mo ago

> Reminds me of running Doom when I had to hack config.sys to forage 640KB of memory.

The good old days of having to do crazy nutty things to get Elite II: Frontier, Magic Carpet, Worms, Xcom: UFO Enemy Unknown, Syndicate et cetera to actually run on my PC :-)

alfiedotwtf4mo ago

That crazy Burt thing these days, is quitting Chrome because it’s consuming 90% ram

reaslonik4mo ago

>I can do whatever the hell I want from this M1 MBA in a hotel room in Japan.

As long as it's within terms and conditions of whatever agreement you made for that $20. I can run queries on my own inference setup from remote locations too

1 more reply

Foobar85684mo ago

Yes but they are really less performant than claude code or codex. I really cried with the 20-25GB models ( 30b Qwen, Devstral etc). They really don't hold a candle, I didn't think the gap was this large or maybe Claude code and GPT performs much better than I imagined.

reaslonik4mo ago

You need to leave much more room for context if you want to do useful work besides entertainment. Luckily there are _several_ PCIe slots on a motherboard. New Nvidia cards at retail(or above) are not the only choice for building a cluster; I threw a pile of Intel Battlemage cards on it and got away with ~30% of the nvidia cost for same capacity (setup was _not_ easy in early 2025 though).

You can gain a lot of performance by using optimal quantization techniques for your setup(ix, awq etc), different llamacpp builds do different between each other and very different compared to something like vLLM

cmclaughlin4mo ago

I also expect local LLMs to catch up to the cloud providers.

I spent last weekend experimenting with Ollama and LM studio. I was impressed at how good Qwen3-Coder is. Not as good as Claude, but close - maybe even better in some ways.

As I understand it, the latest Macs are good for local LLMs due to their unified memory. 32GB of RAM in one of the newer M-series seems to be the "sweet spot" for price versus performance.

ashirviskas4mo ago

How much context do you get with 2GB of leftover VRAM on Nvidia GPU?

electroglyph4mo ago

you need a couple RTX 6000 pros to come close to matching cloud capability

j / k navigate · click thread line to collapse

0 pointsrhdunn4mo ago0 comments

A 4090 has 24GB of VRAM allowing you to run a 22B model entirely in memory at FP8 and 24B models at Q6_K (~19GB).

A 5090 has 32GB of VRAM allowing you to run a 32B model in memory at Q6_K.

You can run larger models by splitting the GPU layers that are run in VRAM vs stored in RAM. That is slower, but still viable.

The Qwen3-Coder-480B-A35B model could also be run on a 4090 or 5090 by splitting the active 35B parameters across VRAM and RAM.

Yes, it will be slower than running it in the cloud. But you can get a long way with a high-end gaming rig.

0 comments

iberator4mo ago

That's out of touch for 90% of developers worldwide

brazukadev4mo ago

Today. But what about in 5 years? Would you bet we will be paying hundreds of billions to OpenAI yearly or buying consumer GPUs? I know what I will be doing.

Dilettante_4mo ago

2 more replies

infecto4mo ago

1 more reply

alfiedotwtf4mo ago

Woah, woah, woah. I thought in 5 years time we would all be out of a job lol

jen729w4mo ago

Honestly though how many people reading this do you think have that setup vs. 85% of us being on a MBx?

> The Qwen3-Coder-480B-A35B model could also be run on a 4090 or 5090 by splitting the active 35B parameters across VRAM and RAM.

Reminds me of running Doom when I had to hack config.sys to forage 640KB of memory.

Less than 0.1% of the people reading this are doing that. Me, I gave $20 to some cloud service and I can do whatever the hell I want from this M1 MBA in a hotel room in Japan.

radicalbyte4mo ago

> Reminds me of running Doom when I had to hack config.sys to forage 640KB of memory.

The good old days of having to do crazy nutty things to get Elite II: Frontier, Magic Carpet, Worms, Xcom: UFO Enemy Unknown, Syndicate et cetera to actually run on my PC :-)

alfiedotwtf4mo ago

That crazy Burt thing these days, is quitting Chrome because it’s consuming 90% ram

reaslonik4mo ago

>I can do whatever the hell I want from this M1 MBA in a hotel room in Japan.

As long as it's within terms and conditions of whatever agreement you made for that $20. I can run queries on my own inference setup from remote locations too

1 more reply

Foobar85684mo ago

reaslonik4mo ago

cmclaughlin4mo ago

I also expect local LLMs to catch up to the cloud providers.

I spent last weekend experimenting with Ollama and LM studio. I was impressed at how good Qwen3-Coder is. Not as good as Claude, but close - maybe even better in some ways.

As I understand it, the latest Macs are good for local LLMs due to their unified memory. 32GB of RAM in one of the newer M-series seems to be the "sweet spot" for price versus performance.

ashirviskas4mo ago

How much context do you get with 2GB of leftover VRAM on Nvidia GPU?

electroglyph4mo ago

you need a couple RTX 6000 pros to come close to matching cloud capability

j / k navigate · click thread line to collapse