I don't trust papers out of “Top Labs” anymore (opens in new tab)

(old.reddit.com)

94 pointszeebeecee3y ago48 comments

48 comments

Eleuther.ai is just a bunch of random, but smart people without capital who decided on Twitter to recreate GPT-3.

Recently they released GPT-NeoX-20B. They mainly coordinate on Discord. They got compute from some company for free.

https://www.eleuther.ai/

Another group called BigScience got a grant from France to use a public institution supercomputer to train large language model in open. They are 71% done training their 176 billion parameters open-source language model called "BLOOM".

> During one-year, from May 2021 to May 2022, 900 researchers from 60 countries and more than 250 institutions are creating together a very large multilingual neural network language model and a very large multilingual text dataset on the 28 petaflops Jean Zay (IDRIS) supercomputer located near Paris, France.

https://bigscience.huggingface.co/

If there is a will there is a way.

BTW - People close to EleutherAI are looking for people wanting to play around with open-source machine learning for biology.

You just need to start contributing on their Discord: https://twitter.com/nc_znc/status/1530545001557643265

mola3y ago

How is this related? OP was complaining that most of these tons of compute papers don't really show mucjg advance theory wise. They say it's obvious by now that putting more compute would slightly push SOA. The comments there add that these fancy papers are hiding more important work by showing some pretty pictures and pumping the PR machines full power.

Isinlor3y ago

I see there a rant that others have compute and he doesn't.

There is plenty of papers showing advance theory wise.

Some even show that big compute is necessary like "A Universal Law of Robustness via Isoperimetry":

> Classically, data interpolation with a parametrized model class is possible as long as the number of parameters is larger than the number of equations to be satisfied. A puzzling phenomenon in deep learning is that models are trained with many more parameters than what this classical theory would suggest. We propose a theoretical explanation for this phenomenon. We prove that for a broad class of data distributions and model classes, overparametrization is necessary if one wants to interpolate the data smoothly. Namely we show that smooth interpolation requires d times more parameters than mere interpolation, where d is the ambient data dimension. We prove this universal law of robustness for any smoothly parametrized function class with polynomial size weights, and any covariate distribution verifying isoperimetry. In the case of two-layers neural networks and Gaussian covariates, this law was conjectured in prior work by Bubeck, Li and Nagaraj. We also give an interpretation of our result as an improved generalization bound for model classes consisting of smooth functions.

https://arxiv.org/abs/2105.12806

I mean, you can be upset at the universe for the way it is, just what is the point?

Veedrac3y ago

I'm sort of baffled whether people saying this sort of stuff actually read ML papers. Because this is just overtly not true, this idea that the majority of papers do scaling and nothing else. There are tons of papers exploring creative ideas, even the one mentioned in the critique, and even the subset of papers that are primarily about scale typically involve meaningful scientific discovery.

saiojd3y ago

The real ugliness in jealousy comes from how it deceives the self. Consider the last part of this post:

"Is this really what we're comfortable with as a community? A handful of corporations and the occasional university waving their dicks at everyone because they've got the compute to burn and we don't"

I honestly think this kind of comment can only come from a place of jealousy. If someone is willing to spend a lot of money on an experiment, shouldn't you be glad it was done? A scientific field is not an athletic competition, where the rules are picked to measure your "worth as a competitor", and where the playing field has to be fair. The point is to move things forward. Many scientific fields have large, technical hurdles which require expensive equipment. If anything, computer science is a rare niche where it sometimes does not. If you want to build a career in a subfield where compute is important, you should do your best to get access to compute. If you are unable to do so while others are, then you might feel anger, shame, jealousy. But these feelings are really a problem you have with yourself, and not with the field of study.

saltcured3y ago

This is effectively the same wealth disparity topic that exhibits itself in many areas of research and in the economy as a whole. Access to resources is conflated with ability or potential. Left unchecked, this bias naturally concentrates power and creates a moat against newcomers.

You are right that this is not specific to computation, but I think you are begging the question by saying it is "jealousy" and that "the point is to move things forward."

It does not require jealousy to ask, "is this how we want to support research?" The question can just as easily arise from empathy, or even from worry about strategic risk. A winner-takes-all approach may be myopic---by slathering attention on short-term successes, we may be neglecting to invest in the development of competitors who would bring future breakthroughs outside the currently entrenched regime.

saiojd3y ago

That's a very good counterargument. Perhaps the true, underlying problem is the lack of social mobility in research.

But, while I really don't think this is problem is particular to machine learning, this type of sentiment (as described in the OP) is very common in the field. I've seen it a lot on reddit, and even in real life. Why so? Why is this form of inequality so hard to swallow for some ML scientists?

1 more reply

Strilanc3y ago

They explicitly say they trust the results. They're complaining that top labs use lots of compute, so the results aren't relevant to someone who can't. They give an example where a paper used 18K TPU core hours. It's easy to find papers that use millions of core hours.

IMO, asking AI people to not use expensive compute is like asking astronomers to please stop using expensive telescopes. The opposite side of this argument is "Gee, it looks like increasing compute helps AI a lot. Why the heck have we been spending so little on compute?" [0].

[0]: https://www.gwern.net/Scaling-hypothesis

plorkyeran3y ago

We already know that more compute hours give better results, and a paper which simply consists of rerunning previous work but with 100x the compute hours for .03% better results has not discovered anything new, and there's no point in reading that paper.

iakh3y ago

I'm not in this space, but I'd expect that they wouldn't know the results beforehand and just publishing the results even when not a major improvement tells the community that they don't need to spend the extra 99x compute. That seems valuable to some degree. Or is the argument that there wasn't any improvement to be had so why even test the 100x extra?

Veedrac3y ago

That might be a reasonable criticism if it remotely reflected reality, but scaling has repeatedly shown to produce qualitatively stronger models, by large margins, doing things that would seem unimaginable for smaller models.

whiplash4513y ago

We are not at this point yet. Recent work shows that more compute and more data let Transformers beat convnets on computer vision tasks. This is a lot more insightful than « more compute gets you a little further ».

Der_Einzige3y ago

The main problem is that it kills the double blind nature of peer review.

If your paper says you trained on 1000TPUs for weeks, we all know you work at Google brain.

This is subversive for our field. It's really really bad that these authors are virtually guaranteed to be accepted for these reasons alone.

ezoe3y ago

CIFAR-10 is consists of 10,000 test images. So 0.03% of CIFAR-10 is 3 images.

At this tiny number, the randomness is starting to affect the scores. Like labeling mistake of test data by human. Maybe, training SotA with different random seeds make its score 0.03% better or worse.

Hell, 17,810 TPU core-hours is a huge number. You can't ignore the work of randomness. What if a cosmic ray hit a specific memory cell which cause the soft memory error, causing a single wrong calculation which ultimately cause the final trained model 0.03% difference?

So, it's more like: "Jeff Dean spent enough money to feed a family of four for half a decade to get a 0.03% of winning lottery on CIFAR-10."

dekhn3y ago

TPUs make numerical errors more frequently than you'd expect- and it's not cosmic rays, it's QA errors (individual chips were manufactured that passed QA but very occasionally, for specific inputs and operations, produce garbage). When you run on a full pod, many workloads will eventually see corruptions, often in the form of a propagating NaN in critical data like the gradient or weights, that the training cannot recover from.

In fact in a recent big paper from Google they mentioned that training occasionally went wonky in completely nonreproducible ways, but I am pretty sure I know what happened.

togaen3y ago

"Jeff Dean spent enough money to feed a family of four for half a decade to get a 0.03% improvement on CIFAR-10."

Nailed it.

Veedrac3y ago

That quote is disingenuous. Do people really think that...

* Jeff Dean, lead of Google's AI division, wrote a paper with all that complexity to get SOTA on CIFAR-10?

* Jeff Dean, whose salary is sometimes estimated as $3m/y and is responsible for the direction of research of many more, is unreasonable for using <$60k of compute at public pricing, and less than that at internal pricing?

* going from a 0.6% error rate to a 0.57% error rate is reasonably summarized as ‘a 0.03% improvement’, ignoring both that it's a 5% reduction in error and that such improvements get harder as you approach (or exceed) the label accuracy of the dataset?

* the accuracy from this paper came purely from scale?

ezoe3y ago

Still, it's 0.03% difference, or 3 images difference out of 10k images in CIFAR-10. Just 3 images.

Re-training SotA with a different random seed may make its score 0.03% difference. Or there was a wrong calculation in 17,810 TPU core-hours due to faulty hardware or cosmic ray hit which cause the final produce model 0.03% difference.

2 more replies

hervature3y ago

Put another way though, the failure rate was decreased from 0.6 to 0.57 or a 5% reduction. That's pretty significant. If you can reduce LASIK failure rate by 5%, that would provide a ton of value although you would be talking about an absolute improvement of 0.001% in success rate.

I agree that the improvements we are seeing are increasingly due to simply spending more time/money/power but that quip is probably the weakest argument. I would have liked to have seen a Fermi calculation where the power used during training is only 1% (or probably much less) of the total power used. The other thing that reeks naivety is basically the world takes a lot of compute. Much more money and compute is wasted on Candy Crush for instance.

phkahler3y ago

It does seem like better algorithms to get similar results from smaller models should be prioritised.

Rather than throwing more compute at a problem for 0.03 better score, show me one tenth the compute with a loss of 0.03 score. That would be impressive and far more useful.

i_am_proteus3y ago

While I am inclined to personally agree with your sentiment, I don't think I have better insights than Richard Sutton: http://incompleteideas.net/IncIdeas/BitterLesson.html

"The biggest lesson that can be read from 70 years of AI research is that general methods that leverage computation are ultimately the most effective, and by a large margin."

sdenton43y ago

There's a great motivation for small model work for big-model results: More efficient use of compute can be leveraged to make big models effectively bigger. Small-model architectural innovations are computational leverage. You can even see the convolution operation in this light; it's much more efficient than the 'giant dense matrix' approach.

EfficientNet is an exemplar of this approach; they made much better small models, and wound up with much higher quality big models as a result of having better architecture overall: https://arxiv.org/pdf/1905.11946.pdf

We're currently seeing some great results with more efficient attention layers, which will make the current 'big' models much more efficient... And unlock a next generation of higher quality big models.

mjburgess3y ago

A modern "AI" models have c. 200bn parameters, say. At 32bit/param that's c. 6TB. At 6 bytes/word, 1T words, or more words than are in all books that have ever been written.

NNs, and models of this kind, are just search engines. They store a compression of of everything ever written, and prediction is just googling through it.

Models performance exponential in parameter count should be just ignored by research. This category of performance is already established by research, more compute and more historical data stored, isnt an interesting research result.

Der_Einzige3y ago

The deep connections between compression and prediction are not always obvious to those not in the field.

To illustrate just how much they are the same, here is (at one point SOTA) lossless text compression with GPT-2

https://bellard.org/libnc/gpt2tc.html

mjburgess3y ago

Well.. I think they're over-stated because of the current wave of AI basically only having naïve compression as its tool.

Is the concept `addition` a compression of the space `(Int, Int, Int)` ?

If you want to say it is, OK for some definition of compression. But that compression isnt "mere" in the modern AI sense, it's "exponentially dense".

In that my concept `addition` can generate arbitrarily large amounts of that decompressed space, which is infinite in size.

There's a kind of trick played in the marketing here: since NNs compress, and since learning "can be seen as compression", NNs learn... no, because NNs aren't "exponentially dense", they're "exponentially large" -- I'd claim, the opposite of learning!

dredmorbius3y ago

https://teddit.net/r/MachineLearning/comments/uyratt/d_i_don...

(HN's rewrite rule for Reddit apparently doesn't catch unqualified domain references. Mods have been contacted.)

robertlagrant3y ago

I don't trust CERN-based studies either. Anyone who needs a large hadron collider is just showing off.

viraptor3y ago

CERN is usually running experiments to validate ideas, confirm new hypotheses, in general push the boundaries. They wouldn't just go "let's do the same thing as last time - we've already got the results, but let's crank up the power to get a noise-level improvement in numbers accuracy".

robertlagrant3y ago

I don't think anyone's trying to only get that level of improvement. And the LHC is the definition of cranking up the power.

randomifcpfan3y ago

Jeff Dean has responded on the original Reddit thread. Clarified the experiment’s purpose, results, and pointed out that researchers could conduct the experiment at a much lower cost than the OP’s estimate.

musicale3y ago

Experimental results become credible when they can be reproduced consistently.

Theoretical results become more credible when they are independently verified.

j7ake3y ago

It’s even worse in biology where some labs consistently publish in Nature, Science, Cell. Some of the papers are outright fraudulent. Don’t even trust the numbers.

At least for ML you can mostly reproduce the results, even in if they’re not that interesting.

dekhn3y ago

I wanted to reproduce (actually: use) a paper published in 2021. They provide a notebook, and I went to run it. I can't even get past the first cell (importing torch) because the API has already changed.

So in biology, they write papers that can't be reproduced becaue they're fraud, but in ML, they write papers that can't be reproduced because the setup is too fragile.

ta9883y ago

You shouldn't "trust" papers, stay critical and verify. Wherever they come from. There is a lot of politics, grad students eager to graduate so they cut corners, cheating PIs, cheatings statisticians... (I've witnessed each of these during my career). What you should trust is when things get built upon other works (from other groups) or when it simply gets reproduced. This does not eliminate the risk of fraud or error but greatly reduce it. The same way do not trust claims from companies based on a single paper especially if the company is run by one of the authors. Again it is just my limited experience but most of the ones I have seen were just full of overblown claims and they just hoped they could jump the ship before it got discovered.

avinassh3y ago

Jeff Dean responded to OP:

(The paper mentioned by OP is https://arxiv.org/abs/2205.12755, and I am one of the two authors, along with Andrea Gesmundo, who did the bulk of the work).

The goal of the work was not to get a high quality cifar10 model. Rather, it was to explore a setting where one can dynamically introduce new tasks into a running system and successfully get a high quality model for the new task that reuses representations from the existing model and introduces new parameters somewhat sparingly, while avoiding many of the issues that often plague multi-task systems, such as catastrophic forgetting or negative transfer. The experiments in the paper show that one can introduce tasks dynamically with a stream of 69 distinct tasks from several separate visual task benchmark suites and end up with a multi-task system that can jointly produce high quality solutions for all of these tasks. The resulting model that is sparsely activated for any given task, and the system introduces fewer and fewer new parameters for new tasks the more tasks that the system has already encountered (see figure 2 in the paper). The multi-task system introduces just 1.4% new parameters for incremental tasks at the end of this stream of tasks, and each task activates on average 2.3% of the total parameters of the model. There is considerable sharing of representations across tasks and the evolutionary process helps figure out when that makes sense and when new trainable parameters should be introduced for a new task.

You can see a couple of videos of the dynamic introduction of tasks and how the system responds here:

https://www.youtube.com/watch?v=THyc5lUC_-w

https://www.youtube.com/watch?v=2scExBaHweY

I would also contend that the cost calculations by OP are off and mischaracterize things, given that the experiments were to train a multi-task model that jointly solves 69 tasks, not to train a model for cifar10. From Table 7, the compute used was a mix of TPUv3 cores and TPUv4 cores, so you can't just sum up the number of core hours, since they have different prices. Unless you think there's some particular urgency to train the cifar10+68-other-tasks model right now, this sort of research can very easily be done using preemptible instances, which are $0.97/TPUv4 chip/hour and $0.60/TPUv3 chip/hour (not the "you'd have to use on-demand pricing of $3.22/hour" cited by OP). With these assumptions, the public Cloud cost of the computation described in Table 7 in the paper is more like $13,960 (using the preemptible prices for 12861 TPUv4 chip hours and 2474.5 TPUv3 chip hours), or about $202 / task.

I think that having sparsely-activated models is important, and that being able to introduce new tasks dynamically into an existing system that can share representations (when appropriate) and avoid catastrophic forgetting is at least worth exploring. The system also has the nice property that new tasks can be automatically incorporated into the system without deciding how to do so (that's what the evolutionary search process does), which seems a useful property for a continual learning system. Others are of course free to disagree that any of this is interesting.

Edit: I should also point out that the code for the paper has been open-sourced at: https://github.com/google-research/google-research/tree/mast...

We will be releasing the checkpoint from the experiments described in the paper soon (just waiting on two people to flip approval bits, and process for this was started before the reddit post by OP).

---

source: https://old.reddit.com/r/MachineLearning/comments/uyratt/d_i...

j / k navigate · click thread line to collapse

48 comments

Isinlor3y ago

Eleuther.ai is just a bunch of random, but smart people without capital who decided on Twitter to recreate GPT-3.

Recently they released GPT-NeoX-20B. They mainly coordinate on Discord. They got compute from some company for free.

https://www.eleuther.ai/

https://bigscience.huggingface.co/

If there is a will there is a way.

BTW - People close to EleutherAI are looking for people wanting to play around with open-source machine learning for biology.

You just need to start contributing on their Discord: https://twitter.com/nc_znc/status/1530545001557643265

mola3y ago

Isinlor3y ago

I see there a rant that others have compute and he doesn't.

There is plenty of papers showing advance theory wise.

Some even show that big compute is necessary like "A Universal Law of Robustness via Isoperimetry":

https://arxiv.org/abs/2105.12806

I mean, you can be upset at the universe for the way it is, just what is the point?

Veedrac3y ago

saiojd3y ago

The real ugliness in jealousy comes from how it deceives the self. Consider the last part of this post:

saltcured3y ago

You are right that this is not specific to computation, but I think you are begging the question by saying it is "jealousy" and that "the point is to move things forward."

saiojd3y ago

That's a very good counterargument. Perhaps the true, underlying problem is the lack of social mobility in research.

1 more reply

Strilanc3y ago

[0]: https://www.gwern.net/Scaling-hypothesis

plorkyeran3y ago

iakh3y ago

Veedrac3y ago

whiplash4513y ago

Der_Einzige3y ago

The main problem is that it kills the double blind nature of peer review.

If your paper says you trained on 1000TPUs for weeks, we all know you work at Google brain.

This is subversive for our field. It's really really bad that these authors are virtually guaranteed to be accepted for these reasons alone.

ezoe3y ago

CIFAR-10 is consists of 10,000 test images. So 0.03% of CIFAR-10 is 3 images.

So, it's more like: "Jeff Dean spent enough money to feed a family of four for half a decade to get a 0.03% of winning lottery on CIFAR-10."

dekhn3y ago

In fact in a recent big paper from Google they mentioned that training occasionally went wonky in completely nonreproducible ways, but I am pretty sure I know what happened.

togaen3y ago

"Jeff Dean spent enough money to feed a family of four for half a decade to get a 0.03% improvement on CIFAR-10."

Nailed it.

Veedrac3y ago

That quote is disingenuous. Do people really think that...

* Jeff Dean, lead of Google's AI division, wrote a paper with all that complexity to get SOTA on CIFAR-10?

* the accuracy from this paper came purely from scale?

ezoe3y ago

Still, it's 0.03% difference, or 3 images difference out of 10k images in CIFAR-10. Just 3 images.

2 more replies

hervature3y ago

phkahler3y ago

It does seem like better algorithms to get similar results from smaller models should be prioritised.

Rather than throwing more compute at a problem for 0.03 better score, show me one tenth the compute with a loss of 0.03 score. That would be impressive and far more useful.

i_am_proteus3y ago

While I am inclined to personally agree with your sentiment, I don't think I have better insights than Richard Sutton: http://incompleteideas.net/IncIdeas/BitterLesson.html

"The biggest lesson that can be read from 70 years of AI research is that general methods that leverage computation are ultimately the most effective, and by a large margin."

sdenton43y ago

mjburgess3y ago

A modern "AI" models have c. 200bn parameters, say. At 32bit/param that's c. 6TB. At 6 bytes/word, 1T words, or more words than are in all books that have ever been written.

NNs, and models of this kind, are just search engines. They store a compression of of everything ever written, and prediction is just googling through it.

Der_Einzige3y ago

The deep connections between compression and prediction are not always obvious to those not in the field.

To illustrate just how much they are the same, here is (at one point SOTA) lossless text compression with GPT-2

https://bellard.org/libnc/gpt2tc.html

mjburgess3y ago

Well.. I think they're over-stated because of the current wave of AI basically only having naïve compression as its tool.

Is the concept `addition` a compression of the space `(Int, Int, Int)` ?

If you want to say it is, OK for some definition of compression. But that compression isnt "mere" in the modern AI sense, it's "exponentially dense".

In that my concept `addition` can generate arbitrarily large amounts of that decompressed space, which is infinite in size.

dredmorbius3y ago

https://teddit.net/r/MachineLearning/comments/uyratt/d_i_don...

(HN's rewrite rule for Reddit apparently doesn't catch unqualified domain references. Mods have been contacted.)

robertlagrant3y ago

I don't trust CERN-based studies either. Anyone who needs a large hadron collider is just showing off.

viraptor3y ago

robertlagrant3y ago

I don't think anyone's trying to only get that level of improvement. And the LHC is the definition of cranking up the power.

randomifcpfan3y ago

musicale3y ago

Experimental results become credible when they can be reproduced consistently.

Theoretical results become more credible when they are independently verified.

j7ake3y ago

It’s even worse in biology where some labs consistently publish in Nature, Science, Cell. Some of the papers are outright fraudulent. Don’t even trust the numbers.

At least for ML you can mostly reproduce the results, even in if they’re not that interesting.

dekhn3y ago

So in biology, they write papers that can't be reproduced becaue they're fraud, but in ML, they write papers that can't be reproduced because the setup is too fragile.

ta9883y ago

avinassh3y ago

Jeff Dean responded to OP:

(The paper mentioned by OP is https://arxiv.org/abs/2205.12755, and I am one of the two authors, along with Andrea Gesmundo, who did the bulk of the work).

You can see a couple of videos of the dynamic introduction of tasks and how the system responds here:

https://www.youtube.com/watch?v=THyc5lUC_-w

https://www.youtube.com/watch?v=2scExBaHweY

Edit: I should also point out that the code for the paper has been open-sourced at: https://github.com/google-research/google-research/tree/mast...

We will be releasing the checkpoint from the experiments described in the paper soon (just waiting on two people to flip approval bits, and process for this was started before the reddit post by OP).

---

source: https://old.reddit.com/r/MachineLearning/comments/uyratt/d_i...

j / k navigate · click thread line to collapse