undefined | Better HN

0 pointsrcme3y ago0 comments

Even if each dataset is biased, I’m not still not sure how you derived logarithmic growth from the general notion of bias in data. For instance, assuming the data is biased, perhaps it is biased in the other direction and contains more common patterns compared to the underlying distribution.

0 comments

CuriouslyC3y ago

There is a lot to this subject, it might be easier if you took a look at https://martinapugliese.github.io/data/heaps-law-languages/.

Note that when plotting corpus size vs unique words, the log plot is expected to be linear.

rcmeOP3y ago

Ah, I see what you mean: the number of unique examples increases logarithmicly with data size, which kind of makes sense. Language, in this case, follows a power law.

I think you argument is that this means smaller datasets are ok because they contain "most" of what the larger datasets contains. But I think this data-power-rule implies the opposite. ML models can often get to 80-90% accuracy on some task. Unfortunately, these models often aren't that useful because that missing 10% of accuracy matters a lot to users. So what this data-power-rule implies is that, in order to get the last 10% of gains, you need 10x the amount of data.

CuriouslyC3y ago

Well, to get back to my original point, if we're trying to improve the quality and accuracy of model writing, and we want to do that by adding quality and accuracy scores to short token sequences, the power law distribution means we could get coverage on a significant portion of the data set by scoring just the most frequent sequences that aren't linguistic trivia. We could probably get to 50% average coverage fairly cheaply, and while diminishing returns would kick in and make getting to 80 or 90% much more expensive, at that point we could use a model to estimate the remainder, and have a perfectly suitable quality/accuracy scores to condition the model on. The model would output those quality/accuracy scores for the generated token sequence as well, so portions of output that were low quality/of questionable accuracy could be flagged.

j / k navigate · click thread line to collapse

0 comments

CuriouslyC3y ago

There is a lot to this subject, it might be easier if you took a look at https://martinapugliese.github.io/data/heaps-law-languages/.

Note that when plotting corpus size vs unique words, the log plot is expected to be linear.

rcmeOP3y ago

Ah, I see what you mean: the number of unique examples increases logarithmicly with data size, which kind of makes sense. Language, in this case, follows a power law.

CuriouslyC3y ago

j / k navigate · click thread line to collapse