> I think groq doesn't use quantization, so the gap between your hardware and groq would be even further apart.
To my knowledge this isn't (absolutely) publicly known but users on /r/LocalLLaMA and elsewhere have provided some pretty clear examples that Groq is almost certainly quantized. Which makes sense considering their memory situation...
An entire GroqRack (42U cabinet) has 14GB of RAM which means it likely can't even reasonably run llama3 8b in BF16/FP16. Let alone 70b, Mixtral, etc.
The amount of hardware required to run their public-facing hosted product likely takes up an obscene amount of floor space, even in int4. Their docs for GrowFlow describe int8 quantization but their toolkit is heavily dependent on ONNX, which has had recent tremendous work in terms of different post training quantization strategies and precisions.
However, the power efficiency vs performance is very good, potentially to the point of being able to use very cheap datacenter/co-location space that isn't capable of meeting the power and (air) cooling densities of datacenter AMD and Nvidia GPU products.
Interestingly I have access to a GroqRack system that I'm hoping to be able to spend some time on this week.