You'll have to clarify this because I'm not sure what you mean by "real world data". Do you mean e.g. data that is made available after a machine learning system is deployed "live"?
As far as I can tell, nobody really does this kind of "extrinsic" evaluation systematically, first of all because it is very expensive: such "real world data" is unlabelled, and must be labelled before the evaluation.
What's more, the "real world data" is very likely to change between deployments of a machine learning system so any evaluation of a model trained last month may not be valid this month.
So this is all basically very expensive in terms of both money and effort (so, money), and so nobody does it. Instead everyone relies on the approximation of real-world performance on their already labelled datasets.