Opening data is an invitation to lawsuits. That is why even the most die-hard open source enthusiasts are reluctant. It is also why people train a model and generate data with it, rather than sharing the original datasets.
These datasets are huge, and it's practically impossible to make sure they are clean of illegal or embarrassing stuff.