1. Claiming that gpt-4o and gpt-4.5 came from the same training run is ridiculous, gpt-4.5 was not distilled from the same pretrain as 4o.
- Mark Chen has literally publicly said as much, it's a completely different pretrain run.
- And clearly if openai has a good big base model before 4.5, they would have released it back in 2024.
"How do you think OAI would be able to serve 4.5 at scale if the model itself was 10x total bigger than everything else?" through pipeline parallelism, not tensor parallelism. Don't need to synchronize an all-reduce across clusters. You lose tons of tokens/sec per user though. That's exactly what we see with gpt-4.5 in real life- slow ~10token/sec inference.
2. 4o was definitely not served fully at 4-bit/6-bit, and even at 4-bit a 1T model wouldn't fit in a Maia cluster with reasonable kv cache for users. You can't quant attention down to 4-bit/6-bit, that would give the model brain damage. A production environment would quant attention down to fp8 at most. Even local home users don't quant attention down to 4 bit. Unsloth UD Q4 quants usually quant attention to Q8. https://huggingface.co/unsloth/Qwen3.5-35B-A3B-GGUF/blob/mai...
blk.0.attn_qkv.weight [2 048, 8 192] Q8_0
Also, Q4/Q6/Qwhatever are quants used by llama.cpp only, and nobody in a production environment would be using llama.cpp at all. So, saying "Qwhatever" is a clear indicator you have no clue what you're talking about.
Since 4o predates widespread MLA, they're clearly using GQA and thus you can estimate the size per token from an approximate attention head size. Note that Azure offers 4o with max context of 128k tokens. That's about 4-8gb kv cache at full context size. Even at 4bit (it's not at 4bit), 4o is 500b at most, if you actually want to serve customers! Providers do not do batch=1 inference, that would leave the GPU core idle while memory bandwidth is saturated. So they'd have to batch many users onto one machine, with all their kv caches resident in memory. There's just no way you can fit a 1T model with 8+ bit attention and a bunch of users' kv cache into 256GB, even if the ffn was fp4.
3. Microsoft leaked the size of 4o, you know. And there's also other estimates. They all estimate 4o at around 200b. https://arxiv.org/pdf/2412.19260 or https://epoch.ai/gradient-updates/frontier-language-models-h...
4. "(We could go into a whole separate spiel about quantization +history, but suffice it to say everything in deployment is just Q4 these days, and training is mostly Q8)"
More accurately, most deployments are FP4 for ffn, and still 8 bit or 16 bit for attention. And only the chinese labs train at FP8. There's very little reason to train at FP8 when your AdamW states and gradients are still FP32 and FP16. And note that even deepseek uses FP16/FP32 AdamW/gradients.
https://arxiv.org/pdf/2412.19437 That's deepseek using FP8 live weight copy + FP32 master + FP32 grad + BF16 moments = 13 bytes per parameter. BF16 weights is 14 bytes per parameter. There's very reason to use FP8 weights over BF16 weights during training, you don't save that much VRAM/compute, unless you're very desperate like Deepseek. Most labs now still train for W16A16 but apply QAT, not train at FP8. Even the chinese labs do this now- Kimi K2.5 is BF16 native, and just quantize ffn down to int4 with QAT. You can tell, because Kimi K2.5 attention tensors are BF16 and not FP8.
4. "instant tree" "And the thinking tree, where o1 = o3 < 5-thinking < 5.2-thinking." What you're describing is a massive waste of money. Nobody's doing that. Each time you distill a model to a different size, you have to do that separately. That's a waste of compute. Nope, openai just took the same model, and kept on posttraining it more and more, and published some checkpoints. That's what everyone does. The various gpt-4o-2024-05-13 and gpt-4o-2024-08-06 and gpt-4o-2024-11-20 and gpt-5 and gpt-5.1 ... and o1 and o3 and gpt-5-thinking models are NOT different sizes.
Every lab takes a model, and iterate on it and train it more and more. Creating a bunch of distills is expensive. Training a model compute is approximately Compute ≈ 6(number of active params)(tokens trained). Posttraining is basically just throwing a few more tokens into the model and doing some forward and backwards passes. I don't know how many tokens they trained on, but it's somewhere in the 10T to 100T range.
Distilling a model compute ≈ [2(big model active params) +6(small model active params)](tokens trained). This is way more expensive per token than training! There's less passes, but you don't get the value you think from distills.
Look at Deepseek! Deepseek V3? 671B total parameters checkpoint. R1? 671B total parameters checkpoint. V3 0324? 671B total parameters checkpoint. R1 0528? 671B total parameters checkpoint. V3.1 combined thinking and non-thinking? 671B total parameters checkpoint. V3.1 Terminus? 671B total parameters checkpoint. V3.2? 671B total parameters checkpoint.
5. Sparsity matters. Nobody currently is going below 1% sparsity.
MoE sparsity is just the ratio of total number of experts to active experts. Most labs settle on around 8 out of 256 (like Deepseek, GLM, etc) aka 6.25%. There's plenty of research showing that models break down at too high of a sparsity, which is why total params is correlated to active params.
Also, please don't use the word "head" to refer to a MoE expert. The word "head" has a specific meaning in ML and it's not that. It's referring to the component in multi-head attention. That's like using the word "transmission" when talking about a car but not referring to the actual transmission. It's making you look really weird.
Actually, we know what architecture openai was using a few years ago- because openai released it. That was the whole point of gpt-oss. Notably, it uses mxfp4 for MoE, but still uses BF16 for GQA attention, and it has 4 of 128 experts sparsity. Yes, even OpenAI realized that staying around 6.25% sparsity is a good idea. And note that OpenAI clearly did not think quantizing attention is a good idea, even if they apply QAT to create a mxfp4 ffn.
Basically, you have no clue what you're talking about. You're somehow claiming that openai is doing a ton of distills, one for each of 4o/o1/o3/gpt-5/gpt-5.1 thinking and nonthinking, to different sizes... instead of just taking a model they already have, and doing more training and more checkpoints like everyone else. They'd be insane if they were doing that.