undefined | Better HN

0 pointsint_19h2y ago0 comments

It does result in a significant degradation relative to unquantized model of the same size, but even with simple llama.cpp K-quantization, it's still worth it all the way down to 2-bit. The chart in this llama.cpp PR speaks for itself:

https://github.com/ggerganov/llama.cpp/pull/1684#issue-17396...

0 comments

p1esk2y ago

Oh wow, you’re right. Though it seems that they are using very small weight group sizes: either 16 or 32 (fp16 scaling factor per group). In this paper it seems there’s no weights grouping, so it’s a bit apples to oranges.

j / k navigate · click thread line to collapse

0 comments

p1esk2y ago

j / k navigate · click thread line to collapse