undefined | Better HN

0 pointsthrowaway3141556mo ago0 comments

What’s the distinction between MXP4 and Q8 exactly?

0 comments

It's a different way of doing quantization (https://huggingface.co/docs/transformers/en/quantization/mxf...) but I think the most important thing is that OpenAI delivered their own quantization (the MXFP4 from OpenAI/GPT-OSS on HuggingFace, guaranteed correct) whereas all the Q8 and other quantizations you see floating around are community efforts, with somewhat uneven results depending on who done it.

Concretely from my testing, both 20B and 120B has a lot higher refusal rate with Q8 compared to MXFP4, and lower quality responses overall. But don't take my word for it, the 20B weights are tiny and relatively effortless to try both versions and compare yourself.

throwaway314155OP6mo ago

Wow, thanks for the info. I'm planning on testing this on my M4 Max w/ 36 GB today.

edit:

So looking here https://ollama.com/library/gpt-oss/tags it seems ollama doesn't even provide the MXFP4 variants, much less hide them.

Is the best way to run these variants via llama.cpp or...?

spullara6mo ago

on the model description page they claim they support it:

Quantization - MXFP4 format

OpenAI utilizes quantization to reduce the memory footprint of the gpt-oss models. The models are post-trained with quantization of the mixture-of-experts (MoE) weights to MXFP4 format, where the weights are quantized to 4.25 bits per parameter. The MoE weights are responsible for 90+% of the total parameter count, and quantizing these to MXFP4 enables the smaller model to run on systems with as little as 16GB memory, and the larger model to fit on a single 80GB GPU.

Ollama is supporting the MXFP4 format natively without additional quantizations or conversions. New kernels are developed for Ollama’s new engine to support the MXFP4 format.

Ollama collaborated with OpenAI to benchmark against their reference implementations to ensure Ollama’s implementations have the same quality.

1 more reply

Patrick_Devine6mo ago

The default ones on Ollama are MXFP4 for the feed forward network and use BF16 for the attention weights. The default weights for llama.cpp quantize those tensors as q8_0 which is why llama.cpp can eek out a little bit more performance at the cost of worse output. If you are using this for coding, you definitely want better output.

You can use the command `ollama show -v gpt-oss:120b` to see the datatype of each tensor.

ode6mo ago

LMStudio

1 more reply

j / k navigate · click thread line to collapse

0 comments

embedding-shape6mo ago

throwaway314155OP6mo ago

Wow, thanks for the info. I'm planning on testing this on my M4 Max w/ 36 GB today.

edit:

So looking here https://ollama.com/library/gpt-oss/tags it seems ollama doesn't even provide the MXFP4 variants, much less hide them.

Is the best way to run these variants via llama.cpp or...?

spullara6mo ago

on the model description page they claim they support it:

Quantization - MXFP4 format

Ollama is supporting the MXFP4 format natively without additional quantizations or conversions. New kernels are developed for Ollama’s new engine to support the MXFP4 format.

Ollama collaborated with OpenAI to benchmark against their reference implementations to ensure Ollama’s implementations have the same quality.

1 more reply

Patrick_Devine6mo ago

You can use the command `ollama show -v gpt-oss:120b` to see the datatype of each tensor.

ode6mo ago

LMStudio

1 more reply

j / k navigate · click thread line to collapse