Yes, and I would like to know how similar the dataset(s) were. Suppose the models were trained only on greedy algorithms and then I provided a dynamic programming problem in the test set, (how) would the model solve it?
> And yet, many humans who participate in these contests are unable to do so (although I guess the issue here is that Github is not properly indexed and searchable for humans?).
Indeed, so we don't know what "difficult" means for <human+indexed Github>, and hence we cannot compare it to <model trained on Github>.
My point is, whenever I see a new achievement of deep learning, I have no frame of reference (apart from my personal biases) of how "trivial" or "awesome" it is. I would like to have a quantity that measures this - I call it generalization difficulty.
Otherwise the datasets and models just keep getting larger, and we have no idea of the full capability of these models.