Aye, and that's the issue I'm trying to understand. How to know if model 1 or model 2 is more "real" or, for my lack of a better term, more useful and reflective of reality?
We can focus on a particular philosophical point, like parsimony / Occam's razor, but as far as I can tell that isn't always sufficient.
There should be some way to determine a model's likelihood of structure beyond "trust me, it works!" If there is, I'm trying to understand it!
I just want to second MJ's points here. You have to remember that 1) all models are wrong and 2) it's models all the way down. Your data is a model: it models the real world distribution, what we might call the target distribution, which is likely intractable and often very different from your data in various conditions. Your metrics are models: obviously given the previous point, but not as obvious from the point that even with perfect data these are still models. Your metrics all have limitations and you must be careful to clearly understand what they are measuring, rather than what you think they are. This is an issue of alignment and the vast majority of people do not consider precisely what their metrics mean and instead rely on the general consensus (great ML example: FID does not measure fidelity, it is distance measurement of distributions. But you shouldn't stop there, that's the start). These get especially fuzzy in higher dimensions where geometries are highly non-intuitive. It is best to remember that metrics are guides and not targets (Goodhart).
> There should be some way to determine a model's likelihood of structure beyond "trust me, it works!" If there is, I'm trying to understand it!
I mean we can use likelihood ;) if we model density of course. But that's not the likelihood that your model is the correct model, it is the likelihood that given the data that you have that your model's parameterization can reasonably model the sampling distribution of data. These are subtly different, the difference is from above. And then we gotta know if you're actually operating on the right number of dimensions. Are you approximating PCA like a typical VAE? Is the bottleneck enough for proper parameterization? Is your data in sufficient dimensionality? Does the fucking manifold hypothesis even hold for your data? What about the distribution assumption? IID? And don't get me started on indistinguishablity in large causal graphs (references in another comment).
So rather in practice it is just best to try to make a model that is robust to your data but always maintain suspicion of it. After all, all models are wrong and you're trying to model data, not have a model of data.
Evaluation is fucking hard (it is far too easy to make mistakes)
I'd take a bayesian approach across an ensemble of models based on the risk of each being right/wrong.
Consider whether Drug A causes or cures cancer. If there's some circumstantial evidence of it causing cancer at rate X in population Y with risk factors Z -- and otherwise broad circumstial evidence of it curing at rate A in pop B with features C...
then what? Then create various scenarios under these (likely contradictory) assumptions. Formulate an appropriate risk. Derive some implied policies.
This is the reality of how almost all actual decisions are made in life, and necessarily so.
The real danger is when ML is used to replace that, and you end up with extremely fragile systems that automate actions of unknown risk -- on the basis they were "99.99%", "accurate", ie., considered uncontrolled experimental condition E1 and not E2...10_0000 which actually occur
You don't. Given observational data alone, it's typically only possible to determine which d-separation equivalence class you're in. Identifying the exact causal structure requires intervening experimentally.
> There should be some way to determine a model's likelihood of structure
Why? If the information isn't there, it isn't there. No technique can change that.
More rigorously: given a graph G for a structural equation model S, construct a DAG G' as follows
- Find a minimal subgraph C_i transitively closed under cycle membership (so a cycle, all the cycles it intersects, all the cycles they intersect, and so on)
- Replace each C_i with a complete graph C'_i on the same number of vertices, preserving outgoing edges.
- Add edges from the parents of any vertices in C_i (if not in C_i themselves) to all vertices in C'_i
- Repeat until acyclic
d-separation in G' then entails independence in S given reasonable smoothness assumptions I don't remember the details of off the top of my head.