>The main thing you can do is support companies and groups who are releasing open source models. They are usually using their own data.
Alternatively we could create standardized open source training data like wikipedia, wikimedia as well as public domain literature and open courseware. I'm sure that there are many other such free and legal sources of data.