undefined | Better HN

0 pointsohyes2y ago0 comments

What hardware do you have that lets you run 7b and do other stuff at the same time?

0 comments

Pretty much any PC with 16GB+ of fast RAM can do this, any PC with a dGPU can do it well.

Maybe a MacBook Pro. The Apple silicon chops can offload a special AI inference engine, and all ram is accessible by all parts of the chip.

gzer02y ago

An M1 Max with 64GB of RAM allows me to run multiple models simultaneously, on top of stable diffusion generating images non-stop + normal chrome, vscode, etc. Definitely feeling the heat, but it's working. Well worth the investment.

selfhoster112y ago

A 7B model at 8-bit quantization takes up 7 GB of RAM. Less if you use a 6-bit quantization, which is nearly as good. Otherwise it's just a question of having enough system RAM and CPU cores, plus maybe a small discrete GPU.

woadwarrior012y ago

You’ll need a bit more than 7GB (~1 GB or so), even at 8 bit quantization, because of the KV-cache. LLM inference is notoriously inefficient without it, because it’s autoregressive.

kkielhofner2y ago

Some projects such as lmdeploy[0] can quantize the KV cache[1] as well to save some VRAM.

Speaking of lmdeploy, it doesn't seem to be widely known but it also supports quantization with AWQ[2] which appears to be superior to the more widely used GPTQ.

The serving backend is Nvidia Triton Inference Server. Not only is Triton extremely fast and efficient, they have a custom TurboMind backend for Triton. With this lmdeploy delivers the best performance I've seen[3].

On my development workstation with an RTX 4090, llama2-chat-13b, AWQ int4, and KV cache int8:

8 concurrent sessions (batch 1): 580 tokens/s

1 concurrent session (batch 1): 105 tokens/s

This is out of the box, I haven't spent any time further optimizing it.

[0] - https://github.com/InternLM/lmdeploy

[1] - https://github.com/InternLM/lmdeploy/blob/main/docs/en/kv_in...

[2] - https://github.com/InternLM/lmdeploy/tree/main#quantization

[3] - https://github.com/InternLM/lmdeploy/tree/main#performance

selfhoster112y ago

6-bit quantizations are supposed to be nearly equivalent to 8-bit, and that does chop 1.5 GB off the model size. I think a 6-bit model should therefore fit, or if that doesn't, 5-bit medium or 5-bit small surely will.

There is always an option to go down the list of available quantizations notch by notch until you find the largest model that works. llama.cpp offers a lot of flexibility in that regard.

FrozenSynapse2y ago

how's the generation speed on CPU?

selfhoster112y ago

On Ryzen 5600X, 7B and 13B run quite fast. Off the top of my head, pure CPU performance is about 25% slower than with an NVIDIA GPU of some kind. I don't remember the numbers off the top of my head, but the generation speed only starts to get annoying for 33B+ models.

_joel2y ago

If you're willing to sacrifice token/s you can even run these on your phone.

j / k navigate · click thread line to collapse

0 comments

brucethemoose22y ago

Pretty much any PC with 16GB+ of fast RAM can do this, any PC with a dGPU can do it well.

hmottestad2y ago

Maybe a MacBook Pro. The Apple silicon chops can offload a special AI inference engine, and all ram is accessible by all parts of the chip.

gzer02y ago

selfhoster112y ago

woadwarrior012y ago

You’ll need a bit more than 7GB (~1 GB or so), even at 8 bit quantization, because of the KV-cache. LLM inference is notoriously inefficient without it, because it’s autoregressive.

kkielhofner2y ago

Some projects such as lmdeploy[0] can quantize the KV cache[1] as well to save some VRAM.

Speaking of lmdeploy, it doesn't seem to be widely known but it also supports quantization with AWQ[2] which appears to be superior to the more widely used GPTQ.

On my development workstation with an RTX 4090, llama2-chat-13b, AWQ int4, and KV cache int8:

8 concurrent sessions (batch 1): 580 tokens/s

1 concurrent session (batch 1): 105 tokens/s

This is out of the box, I haven't spent any time further optimizing it.

[0] - https://github.com/InternLM/lmdeploy

[1] - https://github.com/InternLM/lmdeploy/blob/main/docs/en/kv_in...

[2] - https://github.com/InternLM/lmdeploy/tree/main#quantization

[3] - https://github.com/InternLM/lmdeploy/tree/main#performance

selfhoster112y ago

There is always an option to go down the list of available quantizations notch by notch until you find the largest model that works. llama.cpp offers a lot of flexibility in that regard.

FrozenSynapse2y ago

how's the generation speed on CPU?

selfhoster112y ago

_joel2y ago

If you're willing to sacrifice token/s you can even run these on your phone.

j / k navigate · click thread line to collapse