Running a One Trillion-Parameter LLM Locally on AMD Ryzen AI Max+ Cluster (opens in new tab)

(amd.com)

75 pointsmindcrime25d ago25 comments

25 comments

Cool that it's possible but basically unusable performance characteristics. For an 8192 token prompt they report a ~1.5 minute time-to-first-token and then 8.30tk/s from there. For context ChatGPT is typically <<1s ttft and ~50tk/s.

fc417fc80224d ago

Given that APU only has 4 channels isn't this setup comically starved for bandwidth? By the same token, wouldn't you expect performance to scale approximately linearly as you add additional boxes? And wouldn't you be better off with smaller nodes (ie less RAM and CPU power per box)?

If I'm right about that then if you're willing to go in for somewhere in the vicinity of $30k (24x the Max 385 model) you should be able to achieve ChatGPT performance.

ibeckermayer24d ago

Good thought... I think you're wrong because the dominant factor is bandwidth over the interconnect. In this case they're using 5Gbps over Ethernet; compare that to 80-120 Gbps for a Thunderbolt 5 connected Mac Studio cluster: https://www.youtube.com/watch?v=bFgTxr5yst0

1 more reply

JKCalhoun24d ago

I've never understood the obsession with token/s. I'm fine with asking a question and then going on to another task (which might be making coffee).

Even with a cloud-based LLM where the response is pretty snappy, I still find that I wander off and return when I am ready to digest the entire response.

ibeckermayer24d ago

Your workflow is unusual, oftentimes there is a vigorous back and forth, or a desired output like code generation, etc where a low tk/s drastically effects ux and user productivity.

But the real kicker here is the 90s ttft, that means you ask a question and don't see anything for a full minute and a half.

nitinreddy8824d ago

You are fine with it. But may be rest of the world is not. Anyway, to compare performance/benchmark, we need metrics and this is one of the basic metric to measure.

verdverm25d ago

The setup was around $10k, but maybe more now with mem/ssd prices.

This is a good list, I like my Beelink a lot, my Minisforum likes to turn itself off every couple of weeks, not sure why yet.

https://www.techradar.com/pro/there-are-15-amd-ryzen-ai-max-...

---

Performance is pretty bad (<10/tps) and context is quite limited. Still good to see progress

Prompt Size (tokens) | TFT (s) - Flash Attention Disabled | TFT (s) - Flash Attention Enabled

4096 | 53.7s | 39.7s

8192 | Out Of Memory (OOM) | 90.5s

16384 | Out Of Memory (OOM) | 239.1s

rootusrootus25d ago

> Minisforum likes to turn itself off every couple of weeks, not sure why yet

AFAICT, the answer is "because Minisforum". I don't know if they have a design principle that they should run their systems near the edge of the thermal envelope or what, but Minisforum is the only brand I've had consistent trouble with stability on. My last one got to where it stopped booting altogether, just looped. Since then I've written off Minisforum as a brand, just not worth the hassle.

shrubble24d ago

I would try to add a fan or other cooling; my guess is that the CPU is handling thermal properly but something else is not.

verdverm24d ago

Reply to one, but appreciate all three replies, thank you! Thermal is a good call, I have some cpu pastes around

gmerc24d ago

I put one of those notebook cooling pads underneath, that usually does the trick

elcritch25d ago

That’s pretty awesome!

Though only 5gig Ethernet? Can’t they do usb-c / thunderbolt 40 Gb/s connections like Macs?

namibj24d ago

It's sad that NDA fetishist Broadcom has a de-facto monopoly on PCIe fabric switches; notably we would have functional open source drivers for at least simpler topologies for a while now, and could just set up cheap FNN topologies by using (usually NMVe targeted) bifurcation support on hosts to get several x4 ports with only a comparatively cheap retimer out into "mini SAS hd" (the square shaped 4-Lane connectors) or QSFP+ ports; and then have a few meters reach on generic DAC cables from such standards (even Skylake-era SAS ones (nominally 12 GT/s; PCIe4.0 is 16 GT/s) should typically manage PCIe4; that's just under 64 Gbit/s from each link, with typical desktop/gaming systems delivering 3~5 links without complaints next to a dGPU (that one at fewer than full lanes).

TacticalCoder24d ago

> Though only 5gig Ethernet? Can’t they do usb-c / thunderbolt 40 Gb/s connections like Macs?

Does the network speed matter that much when TFA talks about outputting a few tens of tokens per second? Ain't 5 Gbit/s plenty for that? (I understand the need to load the model but that'd be local already right?)

elcritch24d ago

Running inference requires sharing intermediate matrix results between nodes. Faster networking speeds that up.

1 more reply

evanjrowley24d ago

I appreciate them showcasing Framework Computer hardware, but they would have got 40 GB/s network performance if they had chosen the Minisorum MS-S1 Max: https://www.minisforum.com/pages/new-release-ms-s1-max-ryzen...

mycall24d ago

The FC hardware has room for a PCIe x4 with maybe ConnectX-7 @ 100GbE mode.

jauntywundrkind24d ago

I really wonder if AMD is going to keep getting walloped on the interconnect or if they'll start upping what's available to consumers, at some point.

tills1325d ago

I set up ollama today and can barely run a 3b parameter model before the lag makes it unbearable.

How much is one of these gonna run me?

zeta013424d ago

I've been pretty happy with my Framework Desktop, though I managed to snag it before RAM prices shot through the roof. Currently, a tricked out model is around $2500.

https://frame.work/desktop

Mine sees more use as a Steam machine, but it can run decently large models. Ollama was trivial to get working, and qwen3-coder-next spits out paragraphs of text/code in seconds. I don't really do anything with that, but it's fun to mess around with. (LLMs are still pretty bad at assembly language.)

jcgrillo24d ago

You can buy a 128GB mainboard from framework for $2300, so maybe somewhere a bit over $9k by the time you've got power, storage, cables, racks (they sell those too). I was thinking about getting into one of these Strix Halo setups but decided to go a slightly different route with a lot higher TDP, better throughput, and a bit less VRAM.

https://frame.work/products/framework-desktop-mainboard-amd-...

burnt-resistor25d ago

Framework has gone fully in the tank of Apple consumerization route of unrepairability and unupgradeability with a nonstandard machine, soldered-on RAM, and no meaningful PCIe slots. There's only the superficial appearance of longevity and future-proofness when it's really yet another silo. There's no way to add an IB, FC, or 100/400 GbE NICs to these machines. 5 GbE is a joke. Non-ECC RAM is a joke.

j / k navigate · click thread line to collapse

25 comments

ibeckermayer25d ago

fc417fc80224d ago

If I'm right about that then if you're willing to go in for somewhere in the vicinity of $30k (24x the Max 385 model) you should be able to achieve ChatGPT performance.

ibeckermayer24d ago

1 more reply

JKCalhoun24d ago

I've never understood the obsession with token/s. I'm fine with asking a question and then going on to another task (which might be making coffee).

Even with a cloud-based LLM where the response is pretty snappy, I still find that I wander off and return when I am ready to digest the entire response.

ibeckermayer24d ago

Your workflow is unusual, oftentimes there is a vigorous back and forth, or a desired output like code generation, etc where a low tk/s drastically effects ux and user productivity.

But the real kicker here is the 90s ttft, that means you ask a question and don't see anything for a full minute and a half.

nitinreddy8824d ago

You are fine with it. But may be rest of the world is not. Anyway, to compare performance/benchmark, we need metrics and this is one of the basic metric to measure.

verdverm25d ago

The setup was around $10k, but maybe more now with mem/ssd prices.

This is a good list, I like my Beelink a lot, my Minisforum likes to turn itself off every couple of weeks, not sure why yet.

https://www.techradar.com/pro/there-are-15-amd-ryzen-ai-max-...

---

Performance is pretty bad (<10/tps) and context is quite limited. Still good to see progress

Prompt Size (tokens) | TFT (s) - Flash Attention Disabled | TFT (s) - Flash Attention Enabled

4096 | 53.7s | 39.7s

8192 | Out Of Memory (OOM) | 90.5s

16384 | Out Of Memory (OOM) | 239.1s

rootusrootus25d ago

> Minisforum likes to turn itself off every couple of weeks, not sure why yet

shrubble24d ago

I would try to add a fan or other cooling; my guess is that the CPU is handling thermal properly but something else is not.

verdverm24d ago

Reply to one, but appreciate all three replies, thank you! Thermal is a good call, I have some cpu pastes around

gmerc24d ago

I put one of those notebook cooling pads underneath, that usually does the trick

elcritch25d ago

That’s pretty awesome!

Though only 5gig Ethernet? Can’t they do usb-c / thunderbolt 40 Gb/s connections like Macs?

namibj24d ago

TacticalCoder24d ago

> Though only 5gig Ethernet? Can’t they do usb-c / thunderbolt 40 Gb/s connections like Macs?

elcritch24d ago

Running inference requires sharing intermediate matrix results between nodes. Faster networking speeds that up.

1 more reply

evanjrowley24d ago

mycall24d ago

The FC hardware has room for a PCIe x4 with maybe ConnectX-7 @ 100GbE mode.

jauntywundrkind24d ago

I really wonder if AMD is going to keep getting walloped on the interconnect or if they'll start upping what's available to consumers, at some point.

tills1325d ago

I set up ollama today and can barely run a 3b parameter model before the lag makes it unbearable.

How much is one of these gonna run me?

zeta013424d ago

I've been pretty happy with my Framework Desktop, though I managed to snag it before RAM prices shot through the roof. Currently, a tricked out model is around $2500.

https://frame.work/desktop

jcgrillo24d ago

https://frame.work/products/framework-desktop-mainboard-amd-...

burnt-resistor25d ago

j / k navigate · click thread line to collapse