Imagine not going to school and instead learning everything from random blog posts or reddit comments. You could do it if you read a lot, but it's clearly suboptimal.
That's why OpenAI, and probably every other serious AI company, is investing huge amounts in generating (proprietary) datasets.