"Only" 825 GB actually: https://pile.eleuther.ai/
A not-insignificant fraction of that is definitively copyrighted material, though, which raises some interesting questions when switching to a model of distributing "a smaller trained model plus the original raw training data" (though it seems that the team behind GPT-J are clearly happy to distribute their full set of data anyway, and seem to be enough under the radar to not attract the wrong sort of attention,at least for now).