A lot of processes downstream to inference benefit from having a minimum of care put into the system design. We're talking 80/20 rule stuff here. It's a simple reorientation vs a janky argmax-classifier, but results in assumptions being obeyed broadly, in a max-entropy sense.
The key insight is that all prediction models can equally be framed as energy-based models (y = f(x) -> E = g(x, y)) and the job of ML is to estimate the joint distribution of x and y with suitable max-entropy surrogate distributions, and performing MLE on this variational distribution vs some training data. All the math in the theory follows from this (perhaps excluding causal stuff but actually I am not familiar enough with those techniques to say for sure). Things get a little more complicated when you consider e.g. autoencoders but above still holds.
Obviously with the choice of a poor surrogate distribution, your predictions will on average be worse. Yes, even if you don't care about probabilities and just want max-likelihood predictions -- your predictions will on average be worse. By construction, analysis proceeds by framing the problem as this and following through. A janky argmax-classifier is not exempted from this -- it, too, already implies a surrogate distribution, but you know, statistically speaking, it's probably a pretty bad one. So it makes sense to put a tiny bit more effort to get way closer to representing the space that your data lives in.
Naturally, you could easily find a janky model that outperforms some relatively unoptimized principled model on a specific use case, and many do get lucky with this. But the principled model has a lot more headroom specifically in terms of the information it can hold, because if the design is more or less correct to the problem specification then the inductive bias built into the model matches closely with the structure of the data which is observed.