undefined | Better HN

0 pointsangoragoats1y ago0 comments

You can build a machine that will run the 405b model for much, much less, if you're willing to accept the following caveats:

* You'll be running a Q5(ish) quantized model, not the full model

* You're OK with buying used hardware

* You have two separate 120v circuits available to plug it into (I assume you're in the US), or alternatively a single 240v dryer/oven/RV-style plug.

The build would look something like (approximate secondary market prices in parentheses):

* Asrock ROMED8-2T motherboard ($700)

* A used Epyc Rome CPU ($300-$1000 depending on how many cores you want)

* 256GB of DDR4, 8x 32GB modules ($550)

* nvme boot drive ($100)

* Ten RTX 3090 cards ($700 each, $7000 total)

* Two 1500 watt power supplies. One will power the mobo and four GPUs, and the other will power the remaining six GPUs ($500 total)

* An open frame case, the kind made for crypto miners ($100?)

* PCIe splitters, cables, screws, fans, other misc parts ($500)

Total is about $10k, give or take. You'll be limiting the GPUs (using `nvidia-smi` or similar) to run at 200-225W each, which drastically reduces their top-end power draw for a minimal drop in performance. Plug each power supply into a different AC circuit, or use a dual 120V adapter with a 240V outlet to effectively accomplish the same thing.

When actively running inference you'll likely be pulling ~2500-2800W from the wall, but at idle, the whole system should use about a tenth of that.

It will heat up the room it's in, especially if you use it frequently, but since it's in an open frame case there are lots of options for cooling.

I realize that this setup is still out of the reach of the "average Joe" but for a dedicated (high-end) hobbyist or someone who wants to build a business, this is a surprisingly reasonable cost.

Edit: the other cool thing is that if you use fast DDR4 and populate all 8 RAM slots as I recommend above, the memory bandwidth of this system is competitive with that of Apple silicon -- 204.8GB/sec, with DDR4-3200. Combined with a 32+ core Epyc, you could experiment with running many models completely on the CPU, though Lllama 405b will probably still be excruciatingly slow.

0 comments

bick_nyers1y ago

Would be interesting to see the performance on a dual-socket EPYC system with DDR5 running at maximum speed.

Assuming NUMA doesn't give you headaches (which it will) you would be looking at nearly 1 TB/s

tpm1y ago

But you need cpus with the highest number of chiplets because the memory controller to chiplet interconnect is the (memory bandwidth) limiting factor there. And those are of course the most expensive ones. And then it's still much slower than gpus for llm inference, but at least you have enough memory.

j / k navigate · click thread line to collapse