Anyway, if you quantize to -1, 0, or +1 and then use arithmetic coding, you come out at around 1.58 bits per parameter. And then by skewing the distribution with forced sparsity, you have something like 5% x -1, 90% x 0, 5% x +1 which comes out at about 0.6 bits per parameter after arithmetic coding.
I used that on "gpt_neox.layers.*.mlp.dense_h_to_4h.weight" (HuggingFace PyTorch implementation), for example. But for other layers you need more bits. For example, I could never get gpt_neox.embed_in.weight to less than 2% -2, 8% -1, 80% 0, 8% +1, 2% +2 which comes out at around 1.1 bits per parameter [1]. And then stuff like gpt_neox.layers.0.attention.query_key_value.weight will drive up your overall bits per parameter because those are very difficult to quantize or sparsify. That 1.5 was the average over the entire model and some layers compress even better while others compress worse.
[1] example calculation: https://www.wolframalpha.com/input?i=-%28log2%280.02%29*0.02...
E.g. you quantize a 512-float activation to 512-int8, then matmul with 512x4096, Gelu, 4096x512 all in int8, then de-quantize to 512-float. That means no quantization overhead on those 4,194,304 parameters in your Dense layers.