The desktop RTX 5090 has 1792 GB/s of memory bandwidth partially due to the 512 bit bus width, compared to the DGX Spark with a 256 bit bus and 273 GB/s memory bandwidth.
The RTX 5090 has 32G of VRAM vs the 128G of “VRAM” in the DGX Spark which is really unified memory.
Also the RTX 5090 has 21760 cuda cores vs 6144 in the DGX Spark. (3.5 x as many). And with the much higher bandwidth in the 5090 you have a better shot at keeping them fed. So for embarrassingly parallel workloads the 5090 crushes the Spark.
So if you need to fit big models into VRAM and don’t care about speed too much because you are for example, building something on your desktop that’ll run on data center hardware in production, the DGX Spark is your answer.
If you need speed and 32G of VRAM is plenty, and you don’t care about modeling network interconnections in production, then the RTX 5090 is what you want.
It isn't, because it's a different architecture than the datacenter hardware. They're both called "Blackwell", but that's a lie[1] and you still need "real" datacenter Blackwell card for development work. (For example, you can't configure/tune vLLM on Spark, and then move it into a B200 and even expect it to work, etc.)
[1] -- https://github.com/NVIDIA/dgx-spark-playbooks/issues/22
The nvfuser code doesn't even call it sm_100 vs. sm_120: NVIDIA's internal nomenclature seems to be 2CTA/1CTA, it's a bin. So there are less MMA tilings in the released ISA as of 13.1 / r85 44.
The mnemonic tcgen05.mma doesn't mean anything, it's lowered onto real SASS. FWIW the people I know doing their own drivers say the whole ISA is there, but it doesn't matter.
The family of mnemonics that hits the "Jensen Keynote" path is roughly here: https://docs.nvidia.com/cuda/parallel-thread-execution/#warp....
10x path is hot today on Thor, Spark, 5090, 6000, and data center.
Getting it to trigger reliably on real tilings?
Well that's the game just now. :)
Edit: https://customer-1qh1li9jygphkssl.cloudflarestream.com/1795a...
https://chipsandcheese.com/p/inside-nvidia-gb10s-memory-subs...
But the nicest addition Dell made in my opinion is the retro 90's UNIX workstation-style wallpaper: https://jasoneckert.github.io/myblog/grace-blackwell/
https://www.fsi-embedded.jp/contents/uploads/2018/11/DELLEMC...
Then there is the networking. While Strix Halo systems come with two USB4 40Gbit/s ports, it's difficult to
a) connect more than 3 machines with two ports each
b) get more than 23GBit/s or so per connection, if you're lucky. Latency will also be in the 0.2ms range, which leaves room for improvement.
Something like Apple's RDMA via Thunderbolt would be great to have on Strix Halo…
Prompt processing is literally 3x to 4x higher on GPT-OSS-120B once you are a little bit into your context window, and it is similarly much faster for image generation or any other AI task.
Plus the Nvidia ecosystem, as others have mentioned.
One discussion with benchmarks: https://www.reddit.com/r/LocalLLaMA/comments/1oonomc/comment...
If all you care about is token generation with a tiny context window, then they are very close, but that’s basically the only time. I studied this problem extensively before deciding what to buy, and I wish Strix Halo had been the better option.
So while I think the Strix Halo is a mostly useless machine for any kind of AI, and I think the spark is actually useful, I don't think pure inference is a good use case for them.
It probably only makes sense as a dev kit for larger cloud hardware.
I'm trying to better understand the trade offs, or if it depends on the workload.
It's not really intended to be a great value box for running LLMs at home. Jeff Geerling talks about this in the article.
If I want to do hobby / amateur AI research or do stuff with fine tuning models etc, learn the tooling. I'm better off with the DG10 than AMD or Apple's systems.
The Strix Halo machines look nice. I'd like one of those too. Especially if/when they ever get around to getting it into a compelling laptop.
But I ordered the ASUS Ascent DG10 machine (since it was more easily available for me than the other versions of these) because I want to play around with fine tuning open weight models, learning tooling, etc.
That and I like the idea of having a (non-Apple) Aarch64 linux workstation at home.
Now if the courier would just get their shit together and actually deliver the thing...
I can get about 20 tokens per second on the DGX Spark using llama-3.3-70B with no loss in quality compared to the model you were benchmarking:
llama-server \
--model llama-3.3-70b-instruct-ud-q4_k_xl.gguf \
--model-draft llama-3.2-1b-instruct-ud-q8_k_xl.gguf \
--ctx-size 80000 \
--ctx-size-draft 4096 \
--draft-min 1 \
--draft-max 8 \
--draft-p-min 0.65 \
-ngl 999 \
--flash-attn on \
--parallel 1 \
--no-mmap \
--jinja \
--temp 0.0 \
-fit off
Specdec works well for code, so the prompt I used was "Write a React TypeScript demo". prompt eval time = 313.70 ms / 40 tokens (7.84 ms per token, 127.51 tokens per second)
eval time = 46278.35 ms / 913 tokens (50.69 ms per token, 19.73 tokens per second)
total time = 46592.05 ms / 953 tokens
draft acceptance rate = 0.87616 (757 accepted / 864 generated)
The draft model cannot affect the quality of the output. A good draft model makes token generation faster, and a bad one would slow things down, but the quality will be the same as the main model either way.I recently needed an LLM to batch process me some queries. I ran an ablation on 20+ models from Open Router to find the best one. Guess which ones got 100% accuracy? GPT-5-mini, Grok-4.1-fast and... Llama4 Scout. For comparison, DeepSeek v3.2 got 90%, and the community darling GLM-4.5-Air got 50%. Even the newest GLM-4.7 only got 70%.
Of course, this is just an anecdotal single datapoint which doesn't mean anything, but it shows that Llama 4 is probably underrated.
And the 2x 200 GBit/s QSFP... why would you stack a bunch of these? Does anybody actually use them in day-to-day work/research?
I liked the idea until the final specs came out.
2. and for CUDA dev it's not worth the crazy price when you can dev on a cheap RTX and then rent a GH or GB server for a couple of days if you need to adjust compatibility and scaling.
Sounds interesting; can you suggest any good discussions of this (on the web)?
https://www.dell.com/en-us/shop/desktop-computers/dell-pro-m...
I'm really curious to see how things shift when the M5 Ultra with "tensor" matmul functionality in the GPU cores rolls out. This should be a multiples speed up of that platform.
If you still have the hardware (this and the Mac cluster) can you PLEASE get some advice and run some actually useful benchmarks?
Batching on a single consumer GPU often results in 3-4x the throughput. We have literally no idea what that batching looks like on a $10k+ cluster without otherwise dropping the cash to find out.