It seems unreasonable to require the training data just to be called open source, given it has similar copyright challenges as game assets.
Of course, this wouldn't make the model reproducible. But that's different from open source.
Imagine if Facebook open-sourced their front-end libraries like React but not the back-end.
Imagine if Twitter or Google didn’t publish its Algorithm for how they rank things to display to different people.
You don’t need to imagine. That’s exactly what’s happening! Would you call them open source because their front end is open source? Could you host your own back end on your choice of computers?
No. That’s why I even started https://qbix.com/platform
A better analogy would be some graphics card drivers which ship a massive proprietary GPU firmware blob, and a small(ish) kernel shim to talk with said blob.
Sometimes though the software alone can be near useless without additional assets that aren't necessarily covered by the code license.
Like Quake, having the engine without the assets is useless if what you wanted was to play Quake the game. Neural nets are another prime example, as you mention. Simulators that rely on measured material property databases for usable results also fall into this category, and so on.
So perhaps what we need is new open source licenses that includes the assets needed for the user to be able to reasonably use the program as a whole.