undefined | Better HN

0 pointshnuser1234567mo ago0 comments

A 24GB GPU can run a ~30b parameter model at 4bit quantization at about 8k-12k context length before every GB of VRAM is occupied.

0 comments

Not quite true. Depends on number of KV heads. GLM4 32b at IQ4 quant and Q8 context can run full context with only 20GiB VRAM.

j / k navigate · click thread line to collapse

0 pointshnuser1234567mo ago0 comments

A 24GB GPU can run a ~30b parameter model at 4bit quantization at about 8k-12k context length before every GB of VRAM is occupied.

Not quite true. Depends on number of KV heads. GLM4 32b at IQ4 quant and Q8 context can run full context with only 20GiB VRAM.

j / k navigate · click thread line to collapse