The math checks out though to allow support for large frontier MoE models at similar speeds.
At batch size 1, GPT-OSS-120B has 5.1B active parameters - in FP8, it's in the same size ballpark than our 2B model in FP16 (5.1 GB vs 4GB).
DeepSeek V4 Flash has 13B in mixed FP4/FP8.
Check out the math at the end of our blog post: https://blog.kog.ai/real-time-llm-inference-on-standard-gpus...
No comments yet.