undefined | Better HN

0 pointsrenonce2y ago0 comments

You can, wait for a 4-bit quantized version

0 comments

4 comments · 1 top-level

tarruda2y ago· 3 in thread

I only have a RTX 3070 with 8GB VRam. It can run quantized 7B models well, but this is 8 x 7B. Maybe an RTX 3090 with 24GB VRAM can do it.

espadrine2y ago

Once on llama.cpp, it will likely run on CPU with enough RAM, especially given that the GGUF mmap code only seems to use RAM for the parts of the weights that get used.

burke2y ago

Napkin math: 7x(4/8)x8 is 28GB, and q4 uses a little more than just 4 bits per param, and there’s extra overhead for context, and the FFN to select experts is probably more on top of that.

It would probably fit in 32GB at 4-bit but probably won’t run with sensible quantization/perf on a 3090/4090 without other tricks like offloading. Depending on how likely the same experts are to be chosen for multiple sequential tokens, offloading experts may be viable.

brucethemoose22y ago

It would be very tight. 8x7B 24GB (currently) has more overhead than 70B.

Its theoretically doable, with quantization from the recent 2 bit quant paper and a custom implementation (in exllamav2?)

EDIT: Actually the download is much smaller than 8x7B. Not sure how, but its sized more like a 30B, perfect for a 3090. Very interesting.

j / k navigate · click thread line to collapse