There's a principle more powerful than the bitter lesson: GIGO.
Training to predict internet dump can only give you so much.
There's a paper called something like "learning from textbooks" where they show that a small model trained on high-quality no-nonsense dataset can beat a much bigger model at a task like Python coding.