Google Cloud TPU Multislice Training (opens in new tab)

(cloud.google.com)

109 pointsinfixed2y ago48 comments

48 comments

xnx2y ago

Full title: "Google Cloud demonstrates the world’s largest distributed training job for large language models across 50000+ TPU v5e chips"

Summary from Bard: "This article is about training large language models (LLMs) on Google Cloud TPUs. It discusses the challenges of training LLMs at scale, and how Google Cloud TPU Multislice Training addresses these challenges. The article also details the results of a recent experiment in which Google trained a 128B parameter LLM on 50,944 TPU v5e chips. This experiment is the largest publicly disclosed LLM distributed training job to date."

jrk2y ago

As far as I can tell, the article notably never defines what "slices" are or what "multi-slice" means.

rwitten2y ago

Great questions! Slices are a set of TPU chips that share a fast, private inter-chip-interconnect. Unlike the current GPU generation in clouds, the TPUs on different machines can communicate through this private network. Multislice means that we're using a hierarchical network, where there is both inter-chip-interconnect and normal data-center netowrking.

More details: https://cloud.google.com/tpu/docs/multislice-introduction

(P.S. - contributor on blog post, Google employee, all thoughts my own)

smarterclayton2y ago

Also, I should point out that a set of machines hosting TPUs is referred to as a "pod", which is not the same thing as a Kubernetes pod (also referenced in this doc).

The term "pod" originated in early data center design and occasionally crosses over from HPC to broad use - i.e. nVidia calls the set of DGX machines a "pod" https://blogs.nvidia.com/blog/2021/03/05/what-is-a-cluster-p....

Kubernetes chose "pod" to represent a set of co-scheduled containers, like a "pod of whales". Other systems like Mesos and Google's Borg https://storage.googleapis.com/pub-tools-public-publication-... use "task" to refer to a single container but didn't have a concept for heterogenous co-scheduled tasks at the time.

Somewhat ironically, it now means TPUs on GKE are confusing because we have TPUs hosts organized into "pods", and "pods" for the software using the TPUs.

A Kubernetes pod using a TPU lands on a host which is part of a slice of a TPU pod.

jeffbee2y ago

As your second link mentions in section 2.4, Borg has "allocs" which are basically pods.

1 more reply

leumassuehtam2y ago

Did they use this to train Gemini? Which raises the question, where is Gemini?

lern_too_spel2y ago

Unlikely. One reason Google Cloud is so terrible is that nobody in Google actually uses Google Cloud. It used to be that every time I mentioned this, somebody would jump in and say, "Well actually, Google Domains runs on Google Cloud," and we'd discuss whether Google Domains was a business critical part of Google. https://support.google.com/domains/answer/13689670?hl=en

amf122y ago

> Unlikely. One reason Google Cloud is so terrible is that nobody in Google actually uses Google Cloud.

Well, actually, Google Cloud is just an abstraction on top of internal Google infra, so this isn't the right question. So, it depends on what you want to infer/compare.

lern_too_spel2y ago

> Well, actually, Google Cloud is just an abstraction on top of internal Google infra

I didn't say otherwise. Of course Google Cloud runs on internal Google infrastructure. They wouldn't have an entirely different stack to build Google Cloud on. The problem is that Googlers don't use Google Cloud.

Amazonians use AWS. https://courses.cs.washington.edu/courses/cse452/23wi/papers...

Microsofties use Azure. https://www.zdnet.com/article/microsoft-moves-closer-to-runn...

danielmarkbruce2y ago

It is the right question. It's the right question because Google doesn't dogfood Google Cloud like they should/could. Dogfooding a bunch of stuff at a lower level of abstraction isn't the same thing.

tdullien2y ago

Xoogler here. GCP was not an abstraction on Borg when I was there. GKE isn't either.

So up until late 2018 when I left, very little of Cloud ran on "proper" Google3 infra. This may have shifted slightly (cloud has been fishing for good Google infra to externalize a lot), but in general cloud!=google3 infra.

milesward2y ago

Lots of stuff inside Alphabet runs on GCP. And yes, Google Domains pisses me off.

lern_too_spel2y ago

If anything important ran on Google Cloud, you can bet we'd see a blog post from Google Cloud marketing about that. Yes, many of the money losing side bets from the non-Google companies under the Alphabet umbrella use Google Cloud. That's only because they want the optionality to spin them off if by some miracle any of them are ever worth anything. If they were part of Google, they would use internal infrastructure. If they weren't under Alphabet, they would use AWS or Azure like everyone else.

antifa2y ago

They also announced Claude 2 back in August, but AFAICT there's currently no way to get API access to Claude 2.

DavidSJ2y ago

Question for rwitten or anyone else involved in this project:

I see a per-device batch size of 6 for the 16B model. With 256x199 = 50944 TPUs and a sequence length of 2048, this works out to 104M tokens per batch. This is much larger than typical for training runs of dense LMs of this size, which are usually closer to ~4M tokens per batch.

Was your critical batch size really this large? In other words, did you really see a benefit as compared to a much smaller batch size (and probably many fewer TPUs)? Did you use some special learning rate schedule or optimizer to achieve this?

DavidSJ2y ago

Oops, I dropped the factor of 6, so it’s actually 626M tokens (306k sequences), which is about 150x what’s typical.

p1esk2y ago

I’m confused - why do you multiply number of chips by sequence length? Shouldn’t the total batch size be 50k x6?

sillysaurusx2y ago

Token count. But tokens per batch is a bad metric. I learned this the hard way through experience.

It turns out that what matters is step count. Increasing batch size makes the model train faster, but increasing seq length from 1024 to 2048 doesn’t make it train twice as fast. So saying 104M tokens rather than batch size 50k x 6 is misleading to yourself. (One of the most surprising aspects of learning ML was how easy it is for me to trick myself in various silly ways like that.)

The mental model to make this easy to remember: progress happens in discrete quantities called steps. Training on 104M tokens per step means it’s embedding 104M tokens of knowledge every step. This isn’t the same thing as requiring an special case optimizer due to large batch sizes — sequence length adds "knowledge bandwidth", but otherwise doesn’t mess with the training dynamics.

As far as large batch optimizers, there’s LARS, which google used for their MLPerf results. I imagine they stuck with that. It creates a per-layer confidence metric, so that when the massive batch size makes a massive change, it dampens the change to smooth out the effect across the network. And since it’s a multiply, the shape of the gradient (by "shape" I mean in 3D space, where the Z axis is the intensity of the gradient) remains the same, so it doesn’t harm any knowledge transfer. It’s purely a stabilization aid.

Kind of weird I remember that after three years.

DavidSJ2y ago

Whether you count tokens or sequences, it’s about 25x the usual batch size. My guess is it makes for a fancy benchmark but isn’t actually useful. Would be interested in being proven otherwise.

3 more replies

DavidSJ2y ago

You can measure it either way, and you’ll see it both ways in the literature. In this case it doesn’t matter much how you measure since 2048 is a typical pretraining sequence length. 300k sequences per batch is huge compared to typical batch sizes in the literature, which are closer to 2048, for about 4M tokens total.

sashank_15092y ago

Ok so they claim in the article, 50000 TPU’s is equivalent to 10 exaflop floating point computations. That is equivalent to ~2,512 NVIDIA H100’s, which is like really small. Just shows the difference between TPU’s and GPU’s I guess. Inflection, a new LLM company created a 20,000 H100 cluster, I’m positive OpenAI, Tesla, Meta etc have orchestrated a job on more than 2500 H100 GPU’s.

rwitten2y ago

Hey! I'm an contributor on this (Rafi Witten), all opinions my own.

You're asking the right question but I think the math is off by a bit. The equivalent number on the H100's is 989 TFLOP/s/chip so the equivalent job is ~10K H100's = (10 * 10^18) / (989 * 10^12). (Both chips also have 8-bit acceleration!)

I believe this is the largest ML job both by exaflops and number of chips every demonstrated. Other companies own more chips or exaflops than we show in this job but getting all the hardware working at once on a single job is a different matter! :-)

sashank_15092y ago

I think your math is also slightly off, in the Google article, it claims “that is capable of achieving 10 exa-FLOPs (16-bit).” , so you should be comparing with 16 bit operations from a H100.

989 is TF32 core, for 16 bit it is 1979, so I guess around 5000 H100’s in a single training job would be equivalent to the training job mentioned in this article.

Either way I actually would not be surprised if OpenAI has launched a single job on more than 10k GPU’s, but I also am not very knowledgeable on practical scaling. Congrats on the feat!

aschleck2y ago

1979 16 bit flops on an H100 is with sparsity. See footnote 2 on https://www.nvidia.com/en-us/data-center/h100/. You should be halving it for non-sparse flops.

1 more reply

latchkey2y ago

I'd love to hear more about the challenges of getting the hardware working.

aschleck2y ago

It's worth noting that just because an H100 has a higher flops number doesn't mean your program is actually hitting that number of flops. Modern TPUs are surprisingly competitive with Nvidia on a perf/$ metric, if you're doing cloud ML they are absolutely worth a look. We have been keeping costs down by racking our own GPUs but TPUs are so cost effective that we need to do some thinking about changing our approach.

I'm not certain but I think part of this is that XLA (for example) is a mountain of chip-specific optimizations between your code and the actual operations. So comparing your throughput between GPU and TPU is not just flops-to-flops.

lern_too_spel2y ago

This is a blog post from Google Cloud marketing. It's saying that you, too, could train an LLM on Google Cloud if you hand them enough money. You can't do that on Inflection's or Tesla's clusters. Similar marketing blog post from last year: https://cloud.google.com/blog/products/compute/calculating-1...

The PaLM paper linked in the blog post is about how to get something actually useful out of that compute.

latchkey2y ago

It sounds like they partnered with CoreWeave to use their equipment and it is "only" 3500 gpus so far.

https://inflection.ai/inflection-ai-announces-1-3-billion-of...

https://inflection.ai/nvidia-coreweave-mlperf

abatilo2y ago

https://youtu.be/z3hmfSVmyqg?si=eLPZ0D6ug3D6PreI

As of 2 months ago, they had at least 7000 up and running, fwiw

latchkey2y ago

The meat starts here:

https://youtu.be/z3hmfSVmyqg?feature=shared&t=3328

I'm curious why it is so hard for them to deploy the compute. They seem to be fairly behind schedule.

marmaduke2y ago

> That is equivalent

On what number or op for the h100?

jeffbee2y ago

Something that doesn't seem worth bragging about is that the startup time increases linearly with the cluster size. Wouldn't you want it to be constant? What's the issue there?

smarterclayton2y ago

Disclaimer: work associated with this team, didn't write or review the blog post

Article stated that it was throughput scheduling the pods on the clusters (from unrelated benchmarks that's usually ~300 pods/sec throughput for kube scheduler today) and then doing XLA compilation at pod launch, rather than amortizing once for all jobs.

Optimizing throughput of kube scheduler is a good general opportunity and something I believe we would like to see.

I believe AOT compilation just not a critical optimization for the test, we would recommend it when running large and long training jobs to AOT compile to keep pod start latency low for hardware failures and job restarts (from checkpoints).

> The start times we observed were impressive, but we believe we can improve these even further. We are working on areas such as optimizing scheduling in GKE to increase throughput and enabling ahead-of-time compilation in MaxText to avoid just-in-time compilations on the full cluster.

jeffbee2y ago

Thanks for the context. I remember recently reading a paper from I think Baidu where they claimed to have a container arrival rate in the millions per second, consequently it was practical to operate their whole site in the style of lambda/cloud functions.

Actually now that I am searching for that it seems Baidu has a number of papers on workload orchestration at scale specifically for learning.

smarterclayton2y ago

I will note that a trend I have observed with recent ML - as we increasingly use accelerators and models correspondingly grow in size, we are returning to a "one machine, one workload" paradigm for the biggest training and inference jobs. You might have 8k accelerators, but only 1000 machines, and if you have one container per host 300 schedules / second is fast.

While at the same time as you note we have functional models for container execution that are approaching millions of dispatches for highly partitionable work, especially in data engineering and ETL.

rwitten2y ago

(Contributor on the blog post, all opinions my own)

Agreed with you and we definitely weren't trying to brag! This is fast compared to people's expectations in the space but slow compared to what we should be able to accomplish and will accomplish in the future.

behnamoh2y ago

Can someone ELI5 this?

jeffbee2y ago

Google has a megawatt-scale TPUv5 cluster that they can spare for stunts, and they are flexing it.

filterfiber2y ago

tl;dr - google has big computer (a lot of TPUs)

they just showed they could indeed make 50k TPUs do some flops.

With no paper this is just a marketing press release - the only takeaway is that existing tech stacks can utilize it probably.

j / k navigate · click thread line to collapse

48 comments

xnx2y ago

Full title: "Google Cloud demonstrates the world’s largest distributed training job for large language models across 50000+ TPU v5e chips"

jrk2y ago

As far as I can tell, the article notably never defines what "slices" are or what "multi-slice" means.

rwitten2y ago

More details: https://cloud.google.com/tpu/docs/multislice-introduction

(P.S. - contributor on blog post, Google employee, all thoughts my own)

smarterclayton2y ago

Also, I should point out that a set of machines hosting TPUs is referred to as a "pod", which is not the same thing as a Kubernetes pod (also referenced in this doc).

Somewhat ironically, it now means TPUs on GKE are confusing because we have TPUs hosts organized into "pods", and "pods" for the software using the TPUs.

A Kubernetes pod using a TPU lands on a host which is part of a slice of a TPU pod.

jeffbee2y ago

As your second link mentions in section 2.4, Borg has "allocs" which are basically pods.

1 more reply

leumassuehtam2y ago

Did they use this to train Gemini? Which raises the question, where is Gemini?

lern_too_spel2y ago

amf122y ago

> Unlikely. One reason Google Cloud is so terrible is that nobody in Google actually uses Google Cloud.

Well, actually, Google Cloud is just an abstraction on top of internal Google infra, so this isn't the right question. So, it depends on what you want to infer/compare.

lern_too_spel2y ago

> Well, actually, Google Cloud is just an abstraction on top of internal Google infra

Amazonians use AWS. https://courses.cs.washington.edu/courses/cse452/23wi/papers...

Microsofties use Azure. https://www.zdnet.com/article/microsoft-moves-closer-to-runn...

danielmarkbruce2y ago

It is the right question. It's the right question because Google doesn't dogfood Google Cloud like they should/could. Dogfooding a bunch of stuff at a lower level of abstraction isn't the same thing.

tdullien2y ago

Xoogler here. GCP was not an abstraction on Borg when I was there. GKE isn't either.

milesward2y ago

Lots of stuff inside Alphabet runs on GCP. And yes, Google Domains pisses me off.

lern_too_spel2y ago

antifa2y ago

They also announced Claude 2 back in August, but AFAICT there's currently no way to get API access to Claude 2.

DavidSJ2y ago

Question for rwitten or anyone else involved in this project:

DavidSJ2y ago

Oops, I dropped the factor of 6, so it’s actually 626M tokens (306k sequences), which is about 150x what’s typical.

p1esk2y ago

I’m confused - why do you multiply number of chips by sequence length? Shouldn’t the total batch size be 50k x6?

sillysaurusx2y ago

Token count. But tokens per batch is a bad metric. I learned this the hard way through experience.

Kind of weird I remember that after three years.

DavidSJ2y ago

Whether you count tokens or sequences, it’s about 25x the usual batch size. My guess is it makes for a fancy benchmark but isn’t actually useful. Would be interested in being proven otherwise.

3 more replies

DavidSJ2y ago

sashank_15092y ago

rwitten2y ago

Hey! I'm an contributor on this (Rafi Witten), all opinions my own.

sashank_15092y ago

I think your math is also slightly off, in the Google article, it claims “that is capable of achieving 10 exa-FLOPs (16-bit).” , so you should be comparing with 16 bit operations from a H100.

989 is TF32 core, for 16 bit it is 1979, so I guess around 5000 H100’s in a single training job would be equivalent to the training job mentioned in this article.

Either way I actually would not be surprised if OpenAI has launched a single job on more than 10k GPU’s, but I also am not very knowledgeable on practical scaling. Congrats on the feat!

aschleck2y ago

1979 16 bit flops on an H100 is with sparsity. See footnote 2 on https://www.nvidia.com/en-us/data-center/h100/. You should be halving it for non-sparse flops.

1 more reply

latchkey2y ago

I'd love to hear more about the challenges of getting the hardware working.

aschleck2y ago

lern_too_spel2y ago

The PaLM paper linked in the blog post is about how to get something actually useful out of that compute.

latchkey2y ago

It sounds like they partnered with CoreWeave to use their equipment and it is "only" 3500 gpus so far.

https://inflection.ai/inflection-ai-announces-1-3-billion-of...

https://inflection.ai/nvidia-coreweave-mlperf

abatilo2y ago

https://youtu.be/z3hmfSVmyqg?si=eLPZ0D6ug3D6PreI

As of 2 months ago, they had at least 7000 up and running, fwiw

latchkey2y ago

The meat starts here:

https://youtu.be/z3hmfSVmyqg?feature=shared&t=3328

I'm curious why it is so hard for them to deploy the compute. They seem to be fairly behind schedule.

marmaduke2y ago

> That is equivalent

On what number or op for the h100?

jeffbee2y ago

Something that doesn't seem worth bragging about is that the startup time increases linearly with the cluster size. Wouldn't you want it to be constant? What's the issue there?

smarterclayton2y ago

Disclaimer: work associated with this team, didn't write or review the blog post

Optimizing throughput of kube scheduler is a good general opportunity and something I believe we would like to see.

jeffbee2y ago

Actually now that I am searching for that it seems Baidu has a number of papers on workload orchestration at scale specifically for learning.

smarterclayton2y ago

While at the same time as you note we have functional models for container execution that are approaching millions of dispatches for highly partitionable work, especially in data engineering and ETL.

rwitten2y ago

(Contributor on the blog post, all opinions my own)

behnamoh2y ago

Can someone ELI5 this?

jeffbee2y ago

Google has a megawatt-scale TPUv5 cluster that they can spare for stunts, and they are flexing it.

filterfiber2y ago

tl;dr - google has big computer (a lot of TPUs)

they just showed they could indeed make 50k TPUs do some flops.

With no paper this is just a marketing press release - the only takeaway is that existing tech stacks can utilize it probably.

j / k navigate · click thread line to collapse