It's widely recognized that image recognition models typically perform well also on such data. We don't need to quantify that exactly to conclude that many large (in terms of parameters) models generalize quite well to data neither in the training or the test set.
Provided that the model space is large enough to contain both models that generalize well and models that don't (while still fitting the training data), some explanation why we tend to find generalizing models is required.