Yeah, there will probably have to be some adjustment. In the future, maybe an ML agent will hire people to go find answers for it about questions it has, using us as researchers/mechanical Turks :-) Quality matters more than quantity for something that’s trying to understand the world well and not just building a statistical language model, I imagine that it will be worth it to pay for quality when training heavily used models, to avoid using garbage info. You don’t need 30 different superficial product reviews with a bunch of SEO text if you have one that’s very thoroughly researched.
And in the meantime, with ads no longer working, maybe crypto is actually useful for something here - lightning makes very small transactions possible with basically no fees, and makes it easy to programmatically pay for things. People hate being nickled and dimed, but a professional trying to construct an ML model could reasonably budget for use fees for fast unhindered access to quality training data. An agent could even evaluate its likelihood of learning something new/accurate vs the cost proposed by the server, and choose the subsets to pull.
Just a random idea, but I hope we don’t fight tooth and nail to preserve the trash heap of the internet’s current state.