So if I understand this correctly, the Pile is for code from 2020 backwards? If I wanted anything released in the past 3 years, say something in the SOTA AI space (where a month is a lifetime), I would need the scraper again?
I don't follow how this can compare to direct, live, unrestricted access. I suppose this is just my own hatred of Microsoft shining through. Of course we should accept the status quo, because how dare we suggest Microsoft could operate in a manner that is anti-competitive.
For anyone else trying to catch up, just rent a datacenter, write a crawler, deal with all the intricacies of keeping it in sync in real-time. This sounds trivial, simple even.
I wonder why nobody is doing it? Perhaps everyone doesn't have access to petabytes of storage space, unlimited bandwidth, unlimited proxy-jumps etc.
So the alternative is to buy github?