Every small model that has outperformed GPT-4 has proven to be an overfit, so I would say it is the obvious claim, and any claim opposite that is what we should be skeptical of.
With the exception of task specialization. Fine-tuning a small model such as Mistral 7B on a specific set of tasks can outperform using GPT-4 on those tasks, and with cheaper and faster inference.
Not on the leaderboards mentioned here. That’s my point, you can overfit for specific tasks, you can’t beat them on multi-task leaderboards without training on the test data.