undefined | Better HN

0 pointsiggldiggl4y ago0 comments

> Just doing some napkin math, the whole GPT-J corpus was around 500 billion tokens, which at 4 tokens per byte would be roundabout 2 Terabyte.

"Only" 825 GB actually: https://pile.eleuther.ai/

A not-insignificant fraction of that is definitively copyrighted material, though, which raises some interesting questions when switching to a model of distributing "a smaller trained model plus the original raw training data" (though it seems that the team behind GPT-J are clearly happy to distribute their full set of data anyway, and seem to be enough under the radar to not attract the wrong sort of attention,at least for now).

0 comments

robbedpeter4y ago

Not pointing out such potential problems in public forums is likely to extend the possibility that it remains readily available.

iggldigglOP4y ago

Touché. (Though with regard to those particular problematic bits, they already tweeted themselves about it, and that tweet had more likes than this submission currently has points)

j / k navigate · click thread line to collapse

0 comments

robbedpeter4y ago

Not pointing out such potential problems in public forums is likely to extend the possibility that it remains readily available.

iggldigglOP4y ago

Touché. (Though with regard to those particular problematic bits, they already tweeted themselves about it, and that tweet had more likes than this submission currently has points)

j / k navigate · click thread line to collapse