GShard: Scaling giant models with conditional computation and automatic sharding (opens in new tab)

(arxiv.org)

112 pointsMrUssek5y ago35 comments

35 comments

"Quién es más macho?"

In a very short time, transformers have gone from under 1B, to 1.5B, to 3B, to 5B, to 175B, and now 600B parameters. 1T is only, what, like 67% more parameters, and therefore likely to be achieved in the short term. In fact, the authors of this paper tried 1T but ran into numerical issues that they will surely address soon. Not long after someone crosses 1T, expect 10T to become the next target. And why not? The best-funded AI research groups are in a friendly competition to build the biggest, baddest, meanest m-f-ing models the world has ever seen.

Scores continue to increase with diminishing returns, which is all fine and nice, but more importantly it seems we should expect to see machine-generated text getting much better from a qualitative standpoint -- that is, becoming less and less distinguishable from a lot of human output. That has been the trend so far.

We live in interesting times.

ma2rten5y ago

This is a sparse model. You can't just compare the parameter count against dense models like GPT-3.

Otherwise, Google already had a 137B parameter model in 2017: https://arxiv.org/abs/1701.06538

cs7025y ago

The model is sparsely-gated, not sparse. The individual experts in each mixture of experts are dense layers but they're sparsely activated, i.e., on each forward pass only some of them are conditionally used.

As to comparing parameter counts, I disagree with you. I think it's perfectly OK to compare parameter counts for different kinds of models. It would also be perfectly OK to compare, say, computational efficiency per parameter in each forward pass (which for this model is impressive), but that wasn't the focus on my comment above.

Finally, you're right that I didn't mention all the interim parameter counts that we have seen below 600B in all transformer variants. The list would have been way too long had I tried to include every figure!

gwern5y ago

Probably the most relevant comparison here would be a mix of wallclock-hours and FLOPS. The MoE may be inefficient on a parameter level, but it may be the most efficient way to convert FLOPS into model power (sort of like how you currently do better making models wider than deeper - experts are the ultimate 'width').

1 more reply

ma2rten5y ago

Fair enough, sparse usually means weights are sparse and not activations.

Obviously you can compare parameter count if you really want to, but from a technical point of view training a densely activated model is a much bigger feat. Also, I have personally spoken to one of the authors of this paper and they said sparsely activated models tend to well better on tasks that require knowledge but not tasks that require intelligence (e.g. GLUE).

1 more reply

IfOnlyYouKnew5y ago

Well if you can come up with a more elegant way forward, one that doesn’t require all that hardware and money and should, therefore, by definition be within the reach of the critics of “big” AI, I see no reason why it’s superior qualitative results wouldn’t be appreciated.

cs7025y ago

I'm not a critic! But I can see how my comment, meant to be a bit humorous, could be misinterpreted as being critical.

Personally, I think the friendly race to build bigger models is a great development. As I mentioned above, it seems to be leading to models that generate text/sequences that are qualitatively much better.

xbmcuser5y ago

Are they using this for google translate yet. As https://www.deepl.com/en/translator is better than google translate currently. Although for translating forums on a website etc I think netflix method would be better I hope google adopts it for its translate app https://arxiv.org/abs/2005.11197

cs7025y ago

Highly unlikely at the moment. But clearly that is the direction in which translation is going, so companies lacking the economies of scale that come with owning massive computational infrastructure will be at a serious disadvantage.

Veedrac5y ago

Google Translate's last major update was in May. They still use more restricted model sizes than the top-end research, but the techniques are making their way into the product.

https://ai.googleblog.com/2020/06/recent-advances-in-google-...

With a cursory analysis, it's not obvious whether DeepL is better than Google Translate any more.

dig6x5y ago

"...600 billion parameters using automatic sharding. We demonstrate that such a giant model can efficiently be trained on 2048 TPU v3 accelerators in 4 days to achieve far superior quality for translation from 100 languages to English compared to the prior art."

It does appear that at the initial, resource intensive stages of tech like NLP big tech is primed to pave the way. We saw this happen across cloud, AI more generally, storage etc. But big tech then begins focusing on making the tech accessible to industry value chains (Azure, AWS, Amazon's AI services etc.). But as the industry matures there's more room for specialized startups/companies to enter the space to capture lucrative niches - thats exactly what Snowflake did for Cloud.

Definitely see this kind of scale as a step toward a more robust, mature industry if anything. Better it move forward than not.

saltking1125y ago

sounds interesting. can you elaborate? Not familiar with what Snowflake does or how it compares. Thanks

dig6x5y ago

Launched in 2014, its basically a purpose-built SQL cloud data warehouse solution. Its success pivoted among other factors, on its ability to abstract compute power and data storage to create a modular solution that could be made efficient for any data warehousing configuration.

In 2013 AWS augmented its core cloud offering with the introduction of Redshift, a ‘data warehousing as a service’ solution. The Redshift solution bundled compute and storage, reducing the ability to meet individual customer needs to scale either component separately in a cost efficient manner. Not having the option to unbundle compute and storage was inconsistent with the flexible nature that cloud had become known for.

Snowflake’s solution separated storage, compute, and services into separate layers, allowing them to scale independently and achieve greater cost efficiencies. By offering flexibility it was able to better address the requirements of a wider range of customers - who had previously been limited to the more restrictive bundled options, like Redshift.

ddorian435y ago

What is the cost when getting all data from network compared to local disks ? Is this connected to the cloud not offering local persistent disks ? Will this still work in colocation ?

saltking1125y ago

ic. but from Amazon's perspective, if customers want something that is mostly turn-key with the ability to customize, wouldn't they just combine AWS services themselves? I would believe Amazon has DB only solutions, compute only solutions like EC2 etc... So why was Snowflake able to thrive in this environment? Was the market simply too big?

2 more replies

texasbigdata5y ago

Similarly, not familiar with the unit economics when it's sitting on top of S3 / Azure storage. But it's growing really quick so it must be Something.

mensetmanusman5y ago

Awe inspiring thinking of the number of transistors working in orchestra to translate human language to english...

modeless5y ago

The most important advancements in machine learning for the next 10 years at least will be in hardware, and the software to take advantage of said hardware. You could even say that was already true starting with AlexNet, but it's even more obvious now with these enormous models.

We've barely scratched the surface of what's possible. Even if Moore's Law was dead (though it seems that TSMC may keep it alive for a bit longer) there are huge gains to be had when co-designing models and hardware. Stuff like https://www.cerebras.net/ is the direction I expect things to go.

dna_polymerase5y ago

Hardware will be a huge part, yes, but algorithmic advances would be even better. Utilizing existing commodity hardware to full extent is where the money is at. Specialized hardware will probably remain just that, specialized and mostly too expensive.

modeless5y ago

There will be algorithmic advances, but they will benefit larger models too. Larger models will still win. The value of having the best AI is so great that it will be worth nearly any level of investment in hardware to the large tech companies that can afford it.

Der_Einzige5y ago

Yet another paper with results that basically look like this: https://d3b8hk1o42ev08.cloudfront.net/wp-content/uploads/201...

Still impressive, don't get me wrong, but I am starting to believe that NLP will be dominated increasingly by the big players since they are the only ones who can train a 1 TRILLION parameter model (they show that in the paper). I can't even do inference with a 36 layer, 2048 neuron per layer network with my GTX 2080ti. Sad....

rahimnathwani5y ago

"I can't even do inference with a 36 layer, 2048 neuron per layer network with my GTX 2080ti."

Not even for a single instance? Your GPU has 11GB of RAM. Why isn't 14k per neuron enough? Is the input really large, or does each neuron have very high precision?

MrUssekOP5y ago

There's an extremely large number of parameters per "neuron". The 600B parameters will take up more than 1TB of space in memory, far too much for the 2080 TI or even main memory for most systems.

rahimnathwani5y ago

I'm not talking about inference on a 600B parameter model. GP said they can't do inference on a 32-layer, 2048 neurons-per-layer network. Let's assume every layer is fully connected. So each neuron will have 2048 parameters. So that's 32 * 2048 * 2048 parameters. That's 132MM parameters in 11GB of RAM, or 82 bytes per parameter. If each parameter is 4 bytes (that seems like a lot of precision), plus 4 bytes per calculated value, you're still only using 10% of the GPU's RAM. You should be able to do inference on a batch of 16-20 examples at a time.

What have I missed?

2 more replies

fhssn15y ago

Stay tuned for algorithmic advancements.

teruakohatu5y ago

The brain has ~100+ trillion synapses [1] (There seems to be estimates from 100-1000 T).

A 1 trillion parameter model should not be far off, which is about the same number of synapses as house mice.

We will be around 1% of the way to human brain complexity (Well, probably not but it is fun to think of it).

[1] https://en.wikipedia.org/wiki/List_of_animals_by_number_of_n...

visarga5y ago

You can't directly compare biological and artificial neurons like that. Biological ones have synapses that function in a much more complex way than weights in a neural net, but are also much slower and noisy.

On the other hand, we don't have a robot body to house the model in. Without embodiment it won't be able to learn to interact with the world like us.

Thirdly, in humans, specific priors have been baked in the brain by evolution (data symmetries and efficiencies). We don't know all of them yet and how to replicate. We do rely on translation invariance for images and time shift invariance for sequences, and permutation invariance for some set and graph neural nets, but they are not all the priors the brain makes use of.

Veedrac5y ago

Biological neurons are complex networks of thousands of synapses, and it's definitely reasonably to say a biological neuron is not 1:1 comparable to an artificial NN neuron. Biological neurons can compute XOR[1] and some even contain loops, called autapses.

However it seems fairly reasonable to say a synapse is roughly 1:1 comparable to a network parameter, in that they seem to be doing about the same sort of weighted propagation with about the same computational power. A synapse does work very differently, and has a couple of very low bandwidth side-channels, but its main job is the same job as a network weight.

[1] https://science.sciencemag.org/content/367/6473/83

enchiridion5y ago

Do we really rely on translation invariance significantly? It seems more like we scan a very focused area quickly

justicezyx5y ago

Note that this is a system paper, not a ML/DL/NLP paper. It's kind of OK to expand the parameter to such larger number.

j / k navigate · click thread line to collapse

35 comments

cs7025y ago

"Quién es más macho?"

We live in interesting times.

ma2rten5y ago

This is a sparse model. You can't just compare the parameter count against dense models like GPT-3.

Otherwise, Google already had a 137B parameter model in 2017: https://arxiv.org/abs/1701.06538

cs7025y ago

gwern5y ago

1 more reply

ma2rten5y ago

Fair enough, sparse usually means weights are sparse and not activations.

1 more reply

IfOnlyYouKnew5y ago

cs7025y ago

I'm not a critic! But I can see how my comment, meant to be a bit humorous, could be misinterpreted as being critical.

xbmcuser5y ago

cs7025y ago

Veedrac5y ago

Google Translate's last major update was in May. They still use more restricted model sizes than the top-end research, but the techniques are making their way into the product.

https://ai.googleblog.com/2020/06/recent-advances-in-google-...

With a cursory analysis, it's not obvious whether DeepL is better than Google Translate any more.

dig6x5y ago

Definitely see this kind of scale as a step toward a more robust, mature industry if anything. Better it move forward than not.

saltking1125y ago

sounds interesting. can you elaborate? Not familiar with what Snowflake does or how it compares. Thanks

dig6x5y ago

ddorian435y ago

What is the cost when getting all data from network compared to local disks ? Is this connected to the cloud not offering local persistent disks ? Will this still work in colocation ?

saltking1125y ago

2 more replies

texasbigdata5y ago

Similarly, not familiar with the unit economics when it's sitting on top of S3 / Azure storage. But it's growing really quick so it must be Something.

mensetmanusman5y ago

Awe inspiring thinking of the number of transistors working in orchestra to translate human language to english...

modeless5y ago

dna_polymerase5y ago

modeless5y ago

Der_Einzige5y ago

Yet another paper with results that basically look like this: https://d3b8hk1o42ev08.cloudfront.net/wp-content/uploads/201...

rahimnathwani5y ago

"I can't even do inference with a 36 layer, 2048 neuron per layer network with my GTX 2080ti."

Not even for a single instance? Your GPU has 11GB of RAM. Why isn't 14k per neuron enough? Is the input really large, or does each neuron have very high precision?

MrUssekOP5y ago

There's an extremely large number of parameters per "neuron". The 600B parameters will take up more than 1TB of space in memory, far too much for the 2080 TI or even main memory for most systems.

rahimnathwani5y ago

What have I missed?

2 more replies

fhssn15y ago

Stay tuned for algorithmic advancements.

teruakohatu5y ago

The brain has ~100+ trillion synapses [1] (There seems to be estimates from 100-1000 T).

A 1 trillion parameter model should not be far off, which is about the same number of synapses as house mice.

We will be around 1% of the way to human brain complexity (Well, probably not but it is fun to think of it).

[1] https://en.wikipedia.org/wiki/List_of_animals_by_number_of_n...

visarga5y ago

On the other hand, we don't have a robot body to house the model in. Without embodiment it won't be able to learn to interact with the world like us.

Veedrac5y ago

[1] https://science.sciencemag.org/content/367/6473/83

enchiridion5y ago

Do we really rely on translation invariance significantly? It seems more like we scan a very focused area quickly

justicezyx5y ago

Note that this is a system paper, not a ML/DL/NLP paper. It's kind of OK to expand the parameter to such larger number.

j / k navigate · click thread line to collapse