Benchmarking Google’s new TPUv2 (opens in new tab)

(blog.riseml.com)

192 pointshenningpeters8y ago93 comments

93 comments

Disclosure: I work on Google Cloud.

While not perfect, I want to commend the RiseML folks for doing not only an “just out of the box” run in both regular and fp16 mode (for V100), but also adding their own LSTM experiment to the mix. We need third-party benchmarks whenever new hardware or software are being sold by vendors (reminder: I benefit from you buying Google Cloud!).

I hope the authors are able to collect some of the feedback here and update their benchmark and blog post. The question about batch size comparisons is probably the most direct, but like others, I’d encourage a run on 1, 2, 4 and 8 V100s as well.

elmarhaussmann8y ago

Author here.

Thanks for your feedback and your suggestions (and from everybody else)! We'll make sure to gather all of the valuable feedback and run additional experiments. Different batch sizes and a comparison against >1 GPUs is already planned (and partly executed).

joe_the_user8y ago

So this is a chip that no one outside of Google is going to be able to get a physical copy of ever?

It makes any benchmarks become Google-cloud benchmarks, right?

Edit: I am complaining a bit about the lack of availability but there's also a real point here. If there's no source for TPUs outside of Google, Google Cloud competes only with other cloud providers and with owning physical GPUs - long term, it has no incentive to be anything but little bit more efficient than these however much it's price for producing TPUs declines.

dgacmu8y ago

It's going to be a very exciting multi-company arms race -- at minimum, Google, Intel, Nvidia. Microsoft has their FPGAs, Amazon has their rumors. And there are several startups trying to enter the space. I don't think we're looking at stagnating; very much the opposite. It's going to be fantastic for the field.

(I'm saying this with my CMU hat, not my Google hat.)

joe_the_user8y ago

I would not see things stagnating but it seems like there's a potential for the individual to get cut out of this excitement if each of these entities is keeping it's chips close to it's chest.

The era of the mainframe, with each provider competing with a custom chip, wasn't necessarily beneficial for individual buying computer power.

1 more reply

borramakot8y ago

What are the Amazon rumors?

1 more reply

puzzle8y ago

Making it available outside would require hiring more people and doing more work that they don't need right now.

For a while, it's very likely that Google will be the main user of these, so there's still plenty of incentive for it to increase efficiency and reduce costs.

azinman28y ago

I thought they were going to provide to other cloud providers? I’m also guessing if you’re willing to purchase a lot of them then they’re willing to talk...

joe_the_user8y ago

I'd be interested if anyone has details.

It may be that the other cloud providers would then sell them to those individuals. Indeed, the job of entities called "distributors" to buy big lots from manufacturers and break them up.

And of course, I don't know what the point of (apparently) keeping them out of the average person's hand would be.

1 more reply

angrygoat8y ago

Google claim 29x better performance-per-Watt with TPUs than contemporary GPUs[0]. Interesting to contrast that to the images-per-$ figure in this post, which is more like 2x.

I assume there's a high capital cost for this new hardware, but when they scale it up I wonder if the ratio of cost TPU to GPU will trend towards the ratio of power-per-Watt between the platforms? Seems like a natural limit, even if it never quite gets there.

[0] https://cloud.google.com/blog/big-data/2017/05/an-in-depth-l...

dgacmu8y ago

That was TPUv1 (inference only). This article is about the new Cloud TPU (or TPUv2 as they call it), which handles both inference and training. The competitive landscape also changed a lot in the interim - NVidia added tensor cores to Volta to accelerate deep learning computation.

zitterbewegung8y ago

I think you might want to factor that they are adding their own fees? Also, I think the market may have changed since then (AWS lowered prices and new GPUs have come out). Google's workload is also different than the benchmark that is given in this.

gumby8y ago

> Google claim 29x better performance-per-Watt with TPUs than contemporary GPUs[0]. Interesting to contrast that to the images-per-$ figure in this post, which is more like 2x.

But you aren't paying for the electricity, you're paying for processing, which is an unconnected parameter. They only "sell" these chips per use, not on the open market

Presuming power is a major cost input (which I assume) their profit/op is much higher. So they could sell for less than an equivalent GPU and make more money. But they think they can get away with value pricing it (processing more/unit time is presumably worth it to many customers) and more power to them.

(that last bit was not an intentional pun; only noticed after typing it)

aje4038y ago

Maybe Google will pivot from high tech into crypto mining

1 more reply

jrk8y ago

[Edited] The top line results focus on comparing four TPUs in a rack node (which marketing cleverly named “one cloud TPU”), running ~16 bit mixed precision, to one GPU (out of 8 in a rack node), also capable of 16 bit or mixed precision, but handicapped to 32 bit IEEE 754. That is a misleading comparison. Images/$ are obviously more directly comparable, but again the emphasized comparisons are at different precision. Very different batch sizes make this significantly more misleading, still. Images/$ also only tells us that Google has chosen to look at the competition and set a competitive price; the per-die or per-package comparison is much more relevant to understand any intrinsic architectural advantage, since these are all large dies on roughly comparable process nodes.

boulos8y ago

Disclosure: I work on Google Cloud.

Depends on your metric, Jonathan! If you focus on the per dollar numbers, then it’s actually net favorable to the V100, because a second GPU over NVLINK won’t be as cost-efficient. If what you care about is raw throughput “in a single box”, then 8xV100 probably comes out ahead here.

Like someone else below though, I worry about the “hey wait a minute, changing the batch size just for the TPU seems unfair” and the whole “the LSTM didn’t converge” bit. Not a bad first draft, but hopefully the authors can do some more comparisons.

elmarhaussmann8y ago

Author here.

Thanks for your feedback. As I noted above, we will report further results with larger batch sizes (and smaller ones for the TPU). The LSTM not converging is one of the experiences we wanted to share. We are working on solving this issue and will update the post accordingly. Our goal is really a fair and valuable comparison, which is not easy, so we value all of the feedback.

dgacmu8y ago

That's why you scroll down the page to the cost comparison, which places it on a more even keel. They do also compare float16 on Volta. Physical packaging is irrelevant -- what matters is dollars to convergence and time to convergence.

(I'm obviously biased - I helped with parts of the cloud-side of cloud TPU - but I presume this comment stands on its own. :-)

jrk8y ago

To be clear, I had read the whole post, I was just being terse since the emphasis seemed to be so heavily on an apples to bananas comparison (I believe 100% of the results cited in the prose, many in bold, are with mismatched precision and batch size), with minimal articulation of the many axes of nuance here. Precision isn't defined at all in the LSTM case, and could easily be the cause of the failure of the TPU run to converge where the GPU runs do. To a non-expert audience I think the end result is confusing and misleading.

Also, while I certainly agree that the performance/dollar comparison is highly relevant to customers at a given instant, that may only tell us that Google is subsidizing this hardware now that they've deployed it, and/or that, lacking serious competition, NVIDIA has been building crazy margins into their P100/V100 prices. In understanding fundamental technological tradeoffs, and even the limits of what the pricing in a more competitive market could be, it is relevant to compare performance per unit of hardware resources (mm^2, die/package, watt, GB of HBM, etc.)

In short, these comparisons are hard, and there is no one which tells a complete story. I pushed back because, while the post includes some nuance, it brushes a great deal under the rug and focuses primarily on problematic comparison

(Further disclosure: I'm at least the third person in this sub-thread with some Google Brain/Cloud affiliation. I am speaking in my independent academic voice. I also think TPUs are great, having them publicly available now is great, and competition and diversity of architectural approach in accelerators is great. I appreciate the effort of the authors, but think the subtlety of these comparisons requires serious care.)

dgacmu8y ago

(1) I think we should all collectively agree to ignore the LSTM results -- without the model converging, it's impossible to say whether or not the bug that prevented convergence also affected speed. I can build an arbitrarily fast LSTM that doesn't converge in a few print statements. :-)

(2) Unless Google releases the specs to the h/w, I'd argue that cost is our best proxy. But if you assume that both Google and Amazon want to make a profit on their cloud rentals, it at least gives us a way to get to something we can normalize to (the V100's list price is public, though who knows how much Amazon pays). And, given that you can't buy a Cloud TPU, the price Google charges really is the meaningful answer. It doesn't tell us about fundamentals, but it's the right answer from a consumer standpoint.

I think it's a fair bigger-picture question to ask how we fairly and informatively benchmark cloud-only services in ways that we can not only get consumer-oriented price comparisons, but also learn from the underlying technical choices. The longer-term answer is that we beg Google to write a paper about TPUv2, as they (surprisingly!) did about TPUv1 -- because without that, we just get black box numbers combined with informed speculation based upon glossy board and heatsink photos.

btw - the best current source of specs about TPUv2 was Jeff's NIPS talks: http://learningsys.org/nips17/assets/slides/dean-nips17.pdf

Which mentions a few details like 16GB HBM per chip with 600GB/s memory bandwidth.

(3) I agree completely with you that the comparisons are hard. I'm very glad the authors of the blog post are listening to the feedback they're getting here -- on the LSTM, on batch size comparisons, and about precision and being clear about which things they're measuring.

(Reminder disclosure: It's awkward talking about Google in the third person since they pay me part time, but I'm trying to take this discussion with my academic hat also. This nested series of disclaimers is an amusing commentary about how small the machine learning + systems community is.)

elmarhaussmann8y ago

Thank you for your feedback! (author here)

Our intention is really to provide a sound comparison. I think we agree that these kinds of comparisons can be hard given the constraints (e.g., lack of available technical information on TPUv2 or public implementations of optimized models for certain architectures). As I stated elsewhere, we are collecting all of the feedback and will run additional experiments.

If you know of an implementation of a mixed-precision/fp16 model that you'd like to see results for, please let us know! I may also reach out directly to you for that if you don't mind.

BLanen8y ago

The amount of devices is what is completely irrelevant.

It's all about performance per dollar.

boulos8y ago

Disclosure: I work on Google Cloud.

Not necessarily. The DGX-1, for example, has pretty poor perf/$$ but reduces the time a data scientist spends waiting. For some organizations, their people time is so valuable that what matters is “what gets me my answers back faster”, because that employee is easily $100/hr+.

That’s actually why the 8xV100 with NVLINK is so attractive (and why the TPUs also have board to board networking, not just chip to chip).

twtw8y ago

Sure, for a customer. But from a technological point of view, performance per dollar doesn't tell us everything. A company could subsidize their compute service and get astounding perf/dollar with a not particularly impressive chip.

I'd like to know perf/watt, for instance, even if it doesn't matter to the customer.

jshap708y ago

agreed, it's almost purposefully very misleading. He's not even using the same version of tensorflow, or the current version of cuda (9.1).

elmarhaussmann8y ago

Would you expect a big performance difference from using CUDA 9.1?

elmarhaussmann8y ago

Author here.

Point well taken, we'll make sure to add a comparison to 4 and 8 GPUs. For now, a "Cloud TPU" (containing 8 cores) seems to be the smallest unit to allocate. The question of what exactly makes up a single device and how many to compare against each other is not easy to answer.

gok8y ago

The bar graph seems a little whacky. It groups the TPU (which can only do FP16) with the FP32 results from the GPUs, then puts the FP16 GPU results off to the side even though that's much closer to what the TPU is doing.

Impressive results regardless though; quite a bit faster than V100 than the paper specs would suggest.

ekelsen8y ago

It also seems like the price comparison should compare with the fp16 numbers on both platforms, not the fp32 numbers.

elmarhaussmann8y ago

Author here.

Good point, I agree that the FP16 GPU results should be closer or grouped with the TPU results. We'll try to update accordingly.

slashcom8y ago

Wait but, the batch size is 8x bigger for the TPU? That's not a fair comparison; increasing batch size always speeds things up...

elmarhaussmann8y ago

Author here.

Note that the TPU supports larger batch sizes because it has more RAM. We tested multiple batch sizes for GPUs and reported the fastest one. We'll try increasing the batch sizes as far as possible and report. The overall comparison will likely not change by much - we saw speed increases of around 5% doubling the batch size from 64 to 128. (https://www.tensorflow.org/performance/benchmarks also reports numbers for batch sizes of 32 and 64 on the P100)

boulos8y ago

Disclosure: I work on Google Cloud.

Oh! You should definitely say that. It's semi-reasonable then to choose the batch size that is optimal for the part. It'd be good to make sure this isn't why your LSTM didn't converge though...

elmarhaussmann8y ago

I tested many different batch sizes for the LSTM, so I am pretty confident it's not the reason.

fooker8y ago

But typically does not have an impact on power usage.

They are claiming a 29x improvement in that area.

dkobran8y ago

Just to clarify, is this benchmark leveraging mixed-precision mode on the Volta V100? The major innovation of the Volta generation is mixed-precision which NVIDIA claims is a huge performance increase over the Pascal generation (P100 in the case of your benchmark).

Link to NVIDIA documentation on mixed-precision TensorCores: https://devblogs.nvidia.com/inside-volta/

elmarhaussmann8y ago

Where specified "fp16", the V100 benchmarks use the code from https://github.com/tensorflow/benchmarks/tree/master/scripts... with the flag --use_fp16=true which enables fp16 for some but not all Tensors.

dkobran8y ago

It's my understanding that fp16 (available on the previous generation P100) and mixed-precision (major innovation of V100) are different things and the speedup of TensorCores is entirely missing from this benchmark. Unlike the general purpose P100, the TPU is a heavily optimized chip built for Deep Learning, hence it's performance increase. However, the V100 is also heavily optimized for Deep Learning (arguably the first non-GPU chip) from NVIDIA. I'm in no position to defend NVIDIA here haha but it seems like the benchmark misses the point if this is indeed the case.

elmarhaussmann8y ago

It was my understanding that the TensorFlow benchmarks do make use of TensorCores on the V100. We'll verify and update accordingly.

Nokinside8y ago

Specialization brings speedups.

TPUv2 is specially optimized for deep learning.

Nvidia's Volta microarchitecture is graphics processor with additional tensor units. It's a General-purpose (GPGPU) chip designed with graphics and other scientific computing tasks in mind. Nvidia has enjoyed monopoly power in the market and single microarchitecture has been enough in every high performance category.

Next logical step for Nvidia is to develop specialized deep learning TPU to compete with TPUv2 and others.

twtw8y ago

> Next logical step for Nvidia is to develop specialized deep learning TPU to compete with TPUv2 and others

I don't know, this benchmark seems to show V100 doing pretty well against a specialized ASIC. It may well be that all NVIDIA has to do is cut costs on V100 to make a two V100s about as expensive as the cloud TPUv2. With increased batch size, it looks like two V100s would have performance comparable to TPUv2.

deepnotderp8y ago

Volta V100 already has "tensor cores" which are basically little matrix multiplication ASICs.

Nokinside8y ago

That's what I said.

The microarchctiecture has many unnecessary things and it's not optimized as a whole for deep learning.

deepnotderp8y ago

I believe it was either the last MICRO* or the one before that when Dally addressed this point. The specialized hardware for graphics ends up comprising such a small portion of the overall chip that it wasn't worth it to remove it. The "GPUs were made for graphics thus aren't good for DL" argument really doesn't hold a lot of water IMO.

* It might've been a difference conference now that I think of it.

alexnewman8y ago

The entire idea that people are going to gain some huge advantage over nvidia with hardware softmax seems dubious. I do think it will buy them some time but eventually it seems as though nvidia will win this one.

ysleepy8y ago

I'd be interested how the superior perf/watt claims holds in googles practical setup. The additional Networking gear and power supply losses and so on might make the difference less.

I'm also not sure how we can take googles word for the numbers, since they might as well be eating a less-than-ideal power cost to promote their platform. Any upfront cost will probably offset by locked-in customers later on.

I might just be a bit cynical though.

twtw8y ago

IIRC, TPUv2 uses 16 bit floating point in some format with higher dynamic range and lower precision than standard fp16. Can someone confirm?

If that is right, is the "Tensorflow-optimized" Resnet-50 using 16bit floats when running on TPUv2?

deepnotderp8y ago

Re: fp16 dynamic range: yes.

PaulHoule8y ago

Does this take into account the fact that you might need fewer epochs if you reduce the batch size? (as is done for the CPU?)

elmarhaussmann8y ago

Author here.

No, this is really only comparing the throughput on the devices. A thorough comparison should really focus on time to reach a certain quality - including all of the tricks available for a certain architecture.

neves8y ago

Would I be able to buy one of these for my home? Or just in the cloud? If I can buy, how much would it cost?

elmarhaussmann8y ago

Author here.

These are only available on the Google Cloud right now. I don't think there are plans to sell them anytime soon.

amelius8y ago

> In order to efficiently use TPUs, your code should build on the high-level Estimator abstraction.

Does this mean it's inference-only? (I only quickly scanned the article)

jlebar8y ago

No, this whole blog post is about training models.

chapill8y ago

I wonder if Chinese companies will use (or be allowed to use) TPUs. It seems like a pretty obvious way to have the NSA scoop up any Chinese AI advancements China may want to keep secret.

KirinDave8y ago

It's interesting to me that the assumption here is that government actors are interested in this as opposed to the companies hosting them.

chapill8y ago

Oh, I totally agree with you there. It's just I consider Google a government actor too.

KirinDave8y ago

Does this mean you consider Google a government unto itself, or part of an existing government?

1 more reply

NelsonMinar8y ago

I wonder which Chinese companies are developing their own processors like TPUs.

chapill8y ago

Well, they do have the fastest supercomputer in the world currently and it's made with homegrown chips. No Intel ME backdoors there. Smaller chinese companies could, for a little more money, get similar performance buying 8x V100 machines from NVidia. I don't think they want to share their advancements in AI fighter pilots with USA. They have a big lead.

olfactory8y ago

What is the hardest thing to accomplish with something like a TPU? Is it the IP or the fabrication?

How does the TPU design offer improved performance? By leveraging IP or fabrication improvements?

2 more replies

wmf8y ago

Bitmain already announced some.

bhouston8y ago

It is hard for Google to make money on these TPUs as the whole engineering cost has to be made back from its pricing on Google Cloud, where as with NVIDIA it can pay back its engineering costs via multiple mature channels (games, super computers, and multiple cloud providers.)

I wonder which is higher, the cost for creating the TPUs in terms of engineering and manufacturing or the cost differential in terms of usage as compared to NVIDIA's latest?

I worry about Google long term here. I am surprised the TPU doesn't kick the ass of the NVIDIA chips.

boulos8y ago

Disclosure: I work on Google Cloud.

By the logic above, you would conclude that TPUv1 (the inference-only chip) might have been a mistake, but we’ve been very public about how it “saved us from building lots of datacenters”.

That wasn’t ever sold as part of Cloud, so the benefit there is all from the second bit you mentioned: cheaper and more efficient than GPUs at the time. The paper also goes into more detail, but the size of that initial engineering team and time to market were both quite small.

For training, before Volta (and kind of Pascal), GPUs were the best option but not particularly efficient. Volta does the same “we should have a single instruction that does lots of math in one shot” by cleverly reusing the existing functional units. That the V100 is a great chip, is a good outcome for the whole industry. But GPUs aren’t (and shouldn’t be) just focused on ML. My bet is that there’s still a decent amount of runway left in specialized chips for ML, just as GPUs carved out their own niche versus CPUs.

But again, the “even just for Google” benefit is really enormous so I wouldn’t assume that Cloud has to pay for the entire effort. Could GPU manufacturers improve the cost:performance ratio of ML workloads enough that Google doesn’t have to build TPUs anymore? Perhaps, but like the V100 improvements that would be a great outcome!

puzzle8y ago

Is there going to be an updated paper on performance per Watt, now that TPUv2 is public and V100 has been preannounced on the Google blog?

Mononokay8y ago

There's no real need to worry about Google in the long term - nVidia can make back their money solely with their GPUs; Google probably made their expenses back this weekend with searches around the Olympics. It'd be pointless for them to not use their TPUs themselves, and their main product, Adsense, uses ML.

bitL8y ago

Are you sure about Adsense? Talked to ad pros recently and they all complained Adsense is ancient (still mySQL?) and often broken; doesn't look like Google emphasizes it despite being their cash cow, more like deep state of neglect.

Mononokay8y ago

Yeah, Adsense:

https://techcrunch.com/2018/02/21/google-debuts-adsense-auto...

wallflower8y ago

> still mySQL?

The F1 distributed database was developed to move the AdWords business off of MySQL.

https://research.google.com/pubs/pub41344.html

mayank8y ago

Adsense isn’t Google’s cash cow, Adwords is.

1 more reply

puzzle8y ago

Google probably got back a lot of the engineering costs before it even rented out the first TPU, simply by virtue of running its own workloads, without having to buy tons of CPUs or GPUs. They're also very, very good at reducing computing resource waste (I know this firsthand).

I wouldn't be surprised if public TPUs are to some degree a way to print money: at least for a while, Google can probably just rent out its unused capacity that it had already planned and paid for. :-)

jlebar8y ago

> I am surprised the TPU doesn't kick the ass of the NVIDIA chips.

30% cheaper e2e price for the company's first public offering, compared to the market leader's top-of-the-line chip sounds...pretty good to me?

cobookman8y ago

30% list price. Who knows what the underlying margins are and how much cheaper Google can go with an offline agreement.

voxadam8y ago

Since TPUs are used at Google to process data for its own service offerings (e.g. image classification, voice recognition, language translation, NLP, route planning, etc.) wouldn't it be fair to say that they will also be able to recoup the sunk costs (R&D) by purchasing fewer GPUs?

bitL8y ago

> TPU doesn't kick the ass of the NVIDIA chips

It used to until Volta came out with basically TPUs embedded on the board. We will see if AMD will join them as Vega in theory should be around Volta as well, just tooling is not there.

twtw8y ago

How long has Google had the TPUv2 for internal use? I was under the impression that V100 and TPUv2 where developed around same time. They were certainly announced around the same time at least. Just seems weird to say "it used to," when V100 has been shipping since mid-summer 2017.

bitL8y ago

I think at least for inference TPUv1 was beating all previously available GPUs by a wide margin. TPUv2 did that for training as well, with the exception of Volta.

chapill8y ago

>I am surprised the TPU doesn't kick the ass of the NVIDIA chips.

Yeah, I'm a bit disappointed myself. When announced initially, it seemed Google had a huge lead. But they dragged their feet for two years getting it to market, and now NVidia is nipping at their heels already.

I suspect they are using the TPUs internally for competitive advantage, and these are the leftovers they are done with. They're probably using v4 or v5 internally already.

Veelox8y ago

I agree with you that the cost of TPU development probably out ways the number of dollars that Google will earn renting TPUs. The thing is, no one else has a TPU but Google. That doesn't look like it will change any time soon. That means that if you want to run the fastest machine learning models, you have to use Google Cloud. Now, Google doesn't just benefit from the TPUs, they can now sell more customers to come to their cloud. After that starts happening, all of the best machine learning people will have Google Cloud experience. Then when they start something new, they will use what they know: Google Cloud. Also, they will create the tooling that only works with TPUs and gives an advantage you cannot use outside of Google Cloud. So, it will be a net win for Google even if it is more expensive to run a TPU than what they are renting them for.

tl;dr TPU helps Google Clouds' network effect.

arcanus8y ago

Furthermore, computer hardware is not static. Is this a real long term investment by Google?

If they do not continue to improve on process, they will fall behind in just a few years.

j / k navigate · click thread line to collapse

93 comments

boulos8y ago

Disclosure: I work on Google Cloud.

elmarhaussmann8y ago

Author here.

joe_the_user8y ago

So this is a chip that no one outside of Google is going to be able to get a physical copy of ever?

It makes any benchmarks become Google-cloud benchmarks, right?

dgacmu8y ago

(I'm saying this with my CMU hat, not my Google hat.)

joe_the_user8y ago

I would not see things stagnating but it seems like there's a potential for the individual to get cut out of this excitement if each of these entities is keeping it's chips close to it's chest.

The era of the mainframe, with each provider competing with a custom chip, wasn't necessarily beneficial for individual buying computer power.

1 more reply

borramakot8y ago

What are the Amazon rumors?

1 more reply

puzzle8y ago

Making it available outside would require hiring more people and doing more work that they don't need right now.

For a while, it's very likely that Google will be the main user of these, so there's still plenty of incentive for it to increase efficiency and reduce costs.

azinman28y ago

I thought they were going to provide to other cloud providers? I’m also guessing if you’re willing to purchase a lot of them then they’re willing to talk...

joe_the_user8y ago

I'd be interested if anyone has details.

It may be that the other cloud providers would then sell them to those individuals. Indeed, the job of entities called "distributors" to buy big lots from manufacturers and break them up.

And of course, I don't know what the point of (apparently) keeping them out of the average person's hand would be.

1 more reply

angrygoat8y ago

Google claim 29x better performance-per-Watt with TPUs than contemporary GPUs[0]. Interesting to contrast that to the images-per-$ figure in this post, which is more like 2x.

[0] https://cloud.google.com/blog/big-data/2017/05/an-in-depth-l...

dgacmu8y ago

zitterbewegung8y ago

gumby8y ago

> Google claim 29x better performance-per-Watt with TPUs than contemporary GPUs[0]. Interesting to contrast that to the images-per-$ figure in this post, which is more like 2x.

But you aren't paying for the electricity, you're paying for processing, which is an unconnected parameter. They only "sell" these chips per use, not on the open market

(that last bit was not an intentional pun; only noticed after typing it)

aje4038y ago

Maybe Google will pivot from high tech into crypto mining

1 more reply

jrk8y ago

boulos8y ago

Disclosure: I work on Google Cloud.

elmarhaussmann8y ago

Author here.

dgacmu8y ago

(I'm obviously biased - I helped with parts of the cloud-side of cloud TPU - but I presume this comment stands on its own. :-)

jrk8y ago

dgacmu8y ago

btw - the best current source of specs about TPUv2 was Jeff's NIPS talks: http://learningsys.org/nips17/assets/slides/dean-nips17.pdf

Which mentions a few details like 16GB HBM per chip with 600GB/s memory bandwidth.

elmarhaussmann8y ago

Thank you for your feedback! (author here)

If you know of an implementation of a mixed-precision/fp16 model that you'd like to see results for, please let us know! I may also reach out directly to you for that if you don't mind.

BLanen8y ago

The amount of devices is what is completely irrelevant.

It's all about performance per dollar.

boulos8y ago

Disclosure: I work on Google Cloud.

That’s actually why the 8xV100 with NVLINK is so attractive (and why the TPUs also have board to board networking, not just chip to chip).

twtw8y ago

I'd like to know perf/watt, for instance, even if it doesn't matter to the customer.

jshap708y ago

agreed, it's almost purposefully very misleading. He's not even using the same version of tensorflow, or the current version of cuda (9.1).

elmarhaussmann8y ago

Would you expect a big performance difference from using CUDA 9.1?

elmarhaussmann8y ago

Author here.

gok8y ago

Impressive results regardless though; quite a bit faster than V100 than the paper specs would suggest.

ekelsen8y ago

It also seems like the price comparison should compare with the fp16 numbers on both platforms, not the fp32 numbers.

elmarhaussmann8y ago

Author here.

Good point, I agree that the FP16 GPU results should be closer or grouped with the TPU results. We'll try to update accordingly.

slashcom8y ago

Wait but, the batch size is 8x bigger for the TPU? That's not a fair comparison; increasing batch size always speeds things up...

elmarhaussmann8y ago

Author here.

boulos8y ago

Disclosure: I work on Google Cloud.

Oh! You should definitely say that. It's semi-reasonable then to choose the batch size that is optimal for the part. It'd be good to make sure this isn't why your LSTM didn't converge though...

elmarhaussmann8y ago

I tested many different batch sizes for the LSTM, so I am pretty confident it's not the reason.

fooker8y ago

But typically does not have an impact on power usage.

They are claiming a 29x improvement in that area.

dkobran8y ago

Link to NVIDIA documentation on mixed-precision TensorCores: https://devblogs.nvidia.com/inside-volta/

elmarhaussmann8y ago

dkobran8y ago

elmarhaussmann8y ago

It was my understanding that the TensorFlow benchmarks do make use of TensorCores on the V100. We'll verify and update accordingly.

Nokinside8y ago

Specialization brings speedups.

TPUv2 is specially optimized for deep learning.

Next logical step for Nvidia is to develop specialized deep learning TPU to compete with TPUv2 and others.

twtw8y ago

> Next logical step for Nvidia is to develop specialized deep learning TPU to compete with TPUv2 and others

deepnotderp8y ago

Volta V100 already has "tensor cores" which are basically little matrix multiplication ASICs.

Nokinside8y ago

That's what I said.

The microarchctiecture has many unnecessary things and it's not optimized as a whole for deep learning.

deepnotderp8y ago

* It might've been a difference conference now that I think of it.

alexnewman8y ago

ysleepy8y ago

I'd be interested how the superior perf/watt claims holds in googles practical setup. The additional Networking gear and power supply losses and so on might make the difference less.

I might just be a bit cynical though.

twtw8y ago

IIRC, TPUv2 uses 16 bit floating point in some format with higher dynamic range and lower precision than standard fp16. Can someone confirm?

If that is right, is the "Tensorflow-optimized" Resnet-50 using 16bit floats when running on TPUv2?

deepnotderp8y ago

Re: fp16 dynamic range: yes.

PaulHoule8y ago

Does this take into account the fact that you might need fewer epochs if you reduce the batch size? (as is done for the CPU?)

elmarhaussmann8y ago

Author here.

neves8y ago

Would I be able to buy one of these for my home? Or just in the cloud? If I can buy, how much would it cost?

elmarhaussmann8y ago

Author here.

These are only available on the Google Cloud right now. I don't think there are plans to sell them anytime soon.

amelius8y ago

> In order to efficiently use TPUs, your code should build on the high-level Estimator abstraction.

Does this mean it's inference-only? (I only quickly scanned the article)

jlebar8y ago

No, this whole blog post is about training models.

chapill8y ago

I wonder if Chinese companies will use (or be allowed to use) TPUs. It seems like a pretty obvious way to have the NSA scoop up any Chinese AI advancements China may want to keep secret.

KirinDave8y ago

It's interesting to me that the assumption here is that government actors are interested in this as opposed to the companies hosting them.

chapill8y ago

Oh, I totally agree with you there. It's just I consider Google a government actor too.

KirinDave8y ago

Does this mean you consider Google a government unto itself, or part of an existing government?

1 more reply

NelsonMinar8y ago

I wonder which Chinese companies are developing their own processors like TPUs.

chapill8y ago

olfactory8y ago

What is the hardest thing to accomplish with something like a TPU? Is it the IP or the fabrication?

How does the TPU design offer improved performance? By leveraging IP or fabrication improvements?

2 more replies

wmf8y ago

Bitmain already announced some.

bhouston8y ago

I wonder which is higher, the cost for creating the TPUs in terms of engineering and manufacturing or the cost differential in terms of usage as compared to NVIDIA's latest?

I worry about Google long term here. I am surprised the TPU doesn't kick the ass of the NVIDIA chips.

boulos8y ago

Disclosure: I work on Google Cloud.

By the logic above, you would conclude that TPUv1 (the inference-only chip) might have been a mistake, but we’ve been very public about how it “saved us from building lots of datacenters”.

puzzle8y ago

Is there going to be an updated paper on performance per Watt, now that TPUv2 is public and V100 has been preannounced on the Google blog?

Mononokay8y ago

bitL8y ago

Mononokay8y ago

Yeah, Adsense:

https://techcrunch.com/2018/02/21/google-debuts-adsense-auto...

wallflower8y ago

> still mySQL?

The F1 distributed database was developed to move the AdWords business off of MySQL.

https://research.google.com/pubs/pub41344.html

mayank8y ago

Adsense isn’t Google’s cash cow, Adwords is.

1 more reply

puzzle8y ago

jlebar8y ago

> I am surprised the TPU doesn't kick the ass of the NVIDIA chips.

30% cheaper e2e price for the company's first public offering, compared to the market leader's top-of-the-line chip sounds...pretty good to me?

cobookman8y ago

30% list price. Who knows what the underlying margins are and how much cheaper Google can go with an offline agreement.

voxadam8y ago

bitL8y ago

> TPU doesn't kick the ass of the NVIDIA chips

It used to until Volta came out with basically TPUs embedded on the board. We will see if AMD will join them as Vega in theory should be around Volta as well, just tooling is not there.

twtw8y ago

bitL8y ago

I think at least for inference TPUv1 was beating all previously available GPUs by a wide margin. TPUv2 did that for training as well, with the exception of Volta.

chapill8y ago

>I am surprised the TPU doesn't kick the ass of the NVIDIA chips.

I suspect they are using the TPUs internally for competitive advantage, and these are the leftovers they are done with. They're probably using v4 or v5 internally already.

Veelox8y ago

tl;dr TPU helps Google Clouds' network effect.

arcanus8y ago

Furthermore, computer hardware is not static. Is this a real long term investment by Google?

If they do not continue to improve on process, they will fall behind in just a few years.

j / k navigate · click thread line to collapse