undefined | Better HN

0 pointsawesome_dude14d ago0 comments

Is it that easy?

Machine Learning papers, for example, used to have a terrible reputation for being inconsistent and impossible to replicate.

That didn't make them (all) fraudulent, because that requires intent to deceive.

0 comments

itintheory14d ago

What do you think it is about machine learning that makes it hard to replicate? I'm an outsider to academic research, but it seems like computer based science would be uniquely easy - publish the code, publish the data, and let other people run it. Unless it's a matter of scale, or access to specific hardware.

renewiltord14d ago

A lot of things are easy if you ignore the incentive structure. E.g. a lot of papers will no longer be published if the data must be published. You’d lose all published research from ML labs. Many people like you would say “that’s perfectly okay; we don’t need them” but others prefer to be able to see papers like Language Models Are Few-Shot Learners https://arxiv.org/abs/2005.14165

So the answer is that we still want to see a lot of the papers we currently see because knowing the technique helps a lot. So it’s fine to lose replicability here for us. I’d rather have that paper than replicability through dataset openness.

armchairhacker14d ago

But the lab must publish at least the general category of data, and if that doesn't replicate, then the model only works on a more specific category than they claim (e.g. only their dataset).

1 more reply

avdelazeri14d ago

Lack of will. That was one of the main results from the survey from Whitaker in 2020. Making your code reusable and easy to understand is significant work that had no direct benefits for a researcher's career. Particularly because research code grows wildly as researchers keep trying thungs.

Working on the next paper is seem as the better choice.

Moreover if your code is easy for others to run then you're likely to be hit with people wanting support, or even open yourself to the risk of someone finding errors in your code (the survey's result, not my own beliefs).

There are other issues, of course. Just running the code doesn't mean something is replicable. Science is replicated when studies are repeated independently by many teams.

There are many other failure modes SOTA-hacking, benchmarking, and lack of rigorous analysis of results, for example. And that's ignoring data leakage or other more silly mistakes (that still happen in published work! In work published in very good venues even)

Authors don't do much of anything to disabuse readers that they didn't simply get really look with their pseudorandom number generators during initialization, shuffling, etc. As long as it beats SOTA who cares if it is actually a meaningful improvement? Of course doing multiple runs with a decent bootstrap to get some estimation of the average behavior os often really expensive and really slow, and deadlines are always so tight. There is also the matter that the field converged on a experimentation methodology that isn't actually correct. Once you start reusing test sets your experiments stop being approximations of a random sampling process and you quickly find yourself outside of the grantees provided by statistical theory (this is a similar sort of mistake as the one scientists in other fields do when interpreting p-values). There be dragons out there and statistical demons might come to eat your heart or your network could converge to an implementation of nethack.

Scale also plays into that, of course, and use of private data as the other comment mentioned.

Ultimately Machine Learning research is just too competitive and moves too fast. There are tens of thousands (hundreds maybe?) of people all working on closely related problems, all rushing to publish their results before someone else published something that overlaps too much with their own work. Nobody is going to be as careful as they should, because they can't afford to. It's more profitable to carefully find the minimal publishable amount of work and do that, splitting a result into several small papers you can pump every few months. The first thing that tends to get sacrificed during that process is reliability.

j / k navigate · click thread line to collapse