undefined | Better HN

0 pointsryao1y ago0 comments

Presumably the workstation version will have 64GB of VRAM.

By the way, this is even better as far as memory size is concerned:

https://www.asrockrack.com/minisite/AmpereAltraFamily/

However, memory bandwidth is what matters for token generation. The memory bandwidth of this is only 204.8GB/sec if I understand correctly. Apple's top level hardware reportedly does 800GB/sec.

0 comments

sliken1y ago

AMD Strix Halo is 256GB/sec or so. Similarly AMD's Epyc Sienna family is similar. The EPYC turin family (zen 5) has 576GB/sec or so per socket. Not sure how well any of them do on LLMs. Bandwidth helps, but so does hardware support for FP8 or FP4.

ryaoOP1y ago

Memory bandwidth is the most important thing for token generation. Hardware support for FP8 or FP4 probably does not matter much for token generation. You should be able to run the operations on the CPU in FP32 while reading/writing them from/to memory as FP4/FP8 by doing conversions in the CPU's registers (although to be honest, I have not looked into how those conversions would work). That is how llama.cpp supports BF16 on CPUs that have no BF16 support. Prompt processing would benefit from hardware FP4/FP8 support, since prompt processing is compute bound, not memory bandwidth bound.

As for how well those CPUs do with LLMs. The token generation will be close to model size / memory bandwidth. At least, that is what I have learned from local experiments:

https://github.com/ryao/llama3.c

Note that prompt processing is the phase where the LLM is reading the conversation history and token generation is the phase where the LLM is writing a response.

By the way, you can get an ampere altra motherboard + CPU for $1,434.99:

https://www.newegg.com/asrock-rack-altrad8ud-1l2t-q64-22-amp...

I would be shocked if you can get any EYPC CPU with similar/better memory bandwidth for anything close to that price. As for Strix Halo, anyone doing local inference would love it if it is priced like a gaming part. 4 of them could run llama 3.1 405B on paper. I look forward to seeing its pricing.

sliken1y ago

Hmm, seems pretty close. Not sure how the memory channels related to the performance. But the ampere board above has 8 64 bit channels @ 3200 MHz, the AMD Turins have 24 32 bit channels @ 6400 Mhz. So the AMD memory system is 50% wider, 2x the clock, and 3x the channels.

As for price the AMD Epyc Turin 9115 is $726 and a common supermicro motherboard is $750. Both the Ampere and AMD motherboards have 2x10G. No idea if the AMD's 16 cores with Zen 5 will be able to saturate the memory bus compared to 64 cores of the Amphere Altra.

I do hope the AMD Strix Halo is reasonably priced (256 bits wide @ 8533 MHz), but if not the Nvidia Digit (GB10) looks promising. 128GB ram, likely a wider memory system, and 1 Pflop of FP4 sparse. It's going to be $3k, but with 128GB ram that is approaching reasonable. Seems like it's likely has around 500GB/sec of memory bandwidth, but that is speculation.

Interesting Ampere board, thanks for the link.

lostmsu1y ago

All of this is true only while no software is utilizing parallel inference of multiple LLM queries. The Macs will hit the wall.

ryaoOP1y ago

People interested in running multiple LLM queries in parallel are not people who would consider buying Apple Silicon.

int_19h1y ago

There are other ways to parallelize even a single query for faster output, e.g. speculative decoding with small draft models.

j / k navigate · click thread line to collapse

0 comments

sliken1y ago

ryaoOP1y ago

As for how well those CPUs do with LLMs. The token generation will be close to model size / memory bandwidth. At least, that is what I have learned from local experiments:

https://github.com/ryao/llama3.c

Note that prompt processing is the phase where the LLM is reading the conversation history and token generation is the phase where the LLM is writing a response.

By the way, you can get an ampere altra motherboard + CPU for $1,434.99:

https://www.newegg.com/asrock-rack-altrad8ud-1l2t-q64-22-amp...

sliken1y ago

Interesting Ampere board, thanks for the link.

lostmsu1y ago

All of this is true only while no software is utilizing parallel inference of multiple LLM queries. The Macs will hit the wall.

ryaoOP1y ago

People interested in running multiple LLM queries in parallel are not people who would consider buying Apple Silicon.

int_19h1y ago

There are other ways to parallelize even a single query for faster output, e.g. speculative decoding with small draft models.

j / k navigate · click thread line to collapse