It's still a computer program that uses an enormous amount of copyrighted work as its input.
It seems like you could calculate how much data is within X% error of a 5GB model, and what X% should be for 'visual data'.
I bet it's pretty big.