Please check the recent self-distillation work by MIT-ETH, UCLA and Apple [1],[2],[3],[4],[5].
Given the release timelines I suspect all 4.x after Opus 4 are probably self-distillation based fine-tuned models. The latest paper by Apple is focusing on code generation using the simple technique hence the name simple self-distillation (SSD) [4],[5].
I've got a strong feeling that self-distillation is the second best thing happened to LLM after transformer breakthrough.
[1]Self-Distillation Enables Continual Learning [pdf] (25 comments):
https://news.ycombinator.com/item?id=48165265
[2] Self-Distillation Enables Continual Learning:
https://arxiv.org/abs/2601.19897
[3] Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models:
https://arxiv.org/abs/2601.18734
[4] Embarrassingly simple self-distillation improves code generation (201 comments):
https://news.ycombinator.com/item?id=47637757
[5] Embarrassingly Simple Self-Distillation Improves Code Generation:
https://arxiv.org/abs/2604.01193