(Such a model/statistical-summary, along with a dictionary, could be used to generate nonsensical texts which have similar patterns in terms of just word lengths.)
Should the resulting work be protected by copyright? I’m not entirely sure…
I guess one thing is, the specific numbers I obtain by doing this are not a consequence of any creative decision making on my part, which I think in some jurisdictions (I don’t remember which) plays a role in whether a work is copyrightable (I will use “copyrightable” as an abbreviation for “protected by copyright”. I don’t mean to imply a requirement that someone specifically registers for copyright.). (Iirc this makes it so phone books are copyrightable in some jurisdictions but not others?)
The particular choice of statistical analysis does seem like it may involve creative decision making, but that would just be about like, what analysis I describe, and how the numbers I publish are to be interpreted, not what the numbers are? (Analogous to the source code of an ML model, not the parameters.)
Here is another question: suppose there is a method of producing a data artifact which would be genuinely (and economically) useful, and which does not rely on taking in any copyrighted input, but requires a large (expensive) amount of compute to produce, and which also uses a lot of randomness so that the result would be different each time it was done (but suppose also that there isn’t much point doing it multiple times at the same scale, as having two of this kind of data artifact wouldn’t be much more valuable than having one).
Should such data artifacts be protected by copyright or something like it?
Well, if copyright requires creative human decision making, then they wouldn’t be.
It seems like it would make sense to want it to be economically incentivized to create such data artifacts of higher sizes (to a point of course. Only as much as is justified by the value that is produced by them being available.) .
If such data artifacts can always be distributed without restriction, then ones that are publicly available would be public goods, and I guess only ones that are trade secrets would be private goods? It seems to me like having some mechanism to incentivize their creation and being-eventually-freely-distributed would be beneficial?
But maybe copyright isn’t the best way to do that? Idk.