Say I have a bag of dice one of each of the usual D&D denominations (d4, d6, d8, d10, d12, d20). I draw one at random, ask the models for predictions, and roll it. Model A ignores the information about which one I drew, and predicts a correct distribution of rolls (12.9% chance of rolling a 6). Model B correctly processes the information about which one I drew, and predicts a correct distribution given that information (I drew the d6 so 17% chance of rolling a 6). Both models give correct results overall, but Model B has higher probabilities on average, and I would say it is a better model.
A model should be judged both on how accurately it characterizes its uncertainty and how much evidence it's able to successfully make use of.