I think you could do it hierarchically, and with redundancy.
You'd figure out a replication strategy based on observed reliability (Lindy effect + uptime %).
It would be less "5 million flaky randoms" and more "5,000 very reliable volunteers".
Though for the crawling layer you can and should absolutely utilize 5 million flaky randoms. That's actually the holy grail of crawling. One request per random consumer device.
I think the actual issue wouldn't be the technical issue but the selection. How do you decide what's worth keeping.
You could just do it on a volunteer basis. One volunteer really likes Lizard Facts and volunteers to host that. Or you could dynamically generate the "desired semantic subspace" based on the search traffic...