undefined | Better HN

0 pointskkielhofner1y ago0 comments

llama.cpp and others can run purely on CPU[0]. Even production grade serving frameworks like vLLM[1].

There are a variety of other LLM inference implementations that can run on CPU as well.

[0] - https://github.com/ggerganov/llama.cpp?tab=readme-ov-file#su...

[1] - https://docs.vllm.ai/en/v0.6.1/getting_started/cpu-installat...

0 comments

pajeets1y ago

wait this is crazy

what model can i run on 1TB and how many tokens per second ?

for instance Nvidia Nemotron Llama 3.1 quantized at what speed ? ill get a GPU too but not sure how much VRAM I need for the best value for your buck

kkielhofnerOP1y ago

> what model can i run on 1TB

With 1TB of RAM you can run nearly anything available (405B essentially being the largest ATM). Llama 405B in FP8 precision fits in H100x8 which is 640GB VRAM. Quantization is a very deep and involved well (far too much for an HN comment).

I'm aware it "works" but I don't bother with CPU, GGUF, even llama.cpp so I can't really speak to it. They're just not even remotely usable for my applications.

> tokens per second

Sloooowwww. With 405B it could very well be seconds per token but this is where a lot of system factors come in. You can find benchmarks out there but you'll see stuff like a very high spec AMD EPYC bare metal system with very fast DDR4/5, tons of memory channels, etc doing low single-digit tokens per second with 70B.

> ill get a GPU too but not sure how much VRAM I need for the best value for your buck

Most of my experience is top-end GPU so I can't really speak to this. You may want to pop in at https://www.reddit.com/r/LocalLLaMA/ - there is much more expertise there for this range of hardware (CPU and/or more VRAM limited GPU configs).

j / k navigate · click thread line to collapse