We are releasing 2-bit and 4-bit quantized versions of Mixtral utilizing the HQQ method that we just published
https://mobiusml.github.io/hqq_blog/ and
https://github.com/mobiusml/hqq.
The 2-bit version can run on a 24GB Titan RTX.
In terms of perplexity scores on the wikitext2 dataset, the results are as follows:
Mixtral: 26GB / 3.79
Llama2-70B: 26.37GB / 4.13