While not perfect, I want to commend the RiseML folks for doing not only an “just out of the box” run in both regular and fp16 mode (for V100), but also adding their own LSTM experiment to the mix. We need third-party benchmarks whenever new hardware or software are being sold by vendors (reminder: I benefit from you buying Google Cloud!).
I hope the authors are able to collect some of the feedback here and update their benchmark and blog post. The question about batch size comparisons is probably the most direct, but like others, I’d encourage a run on 1, 2, 4 and 8 V100s as well.
Thanks for your feedback and your suggestions (and from everybody else)! We'll make sure to gather all of the valuable feedback and run additional experiments. Different batch sizes and a comparison against >1 GPUs is already planned (and partly executed).
It makes any benchmarks become Google-cloud benchmarks, right?
Edit: I am complaining a bit about the lack of availability but there's also a real point here. If there's no source for TPUs outside of Google, Google Cloud competes only with other cloud providers and with owning physical GPUs - long term, it has no incentive to be anything but little bit more efficient than these however much it's price for producing TPUs declines.
(I'm saying this with my CMU hat, not my Google hat.)
For a while, it's very likely that Google will be the main user of these, so there's still plenty of incentive for it to increase efficiency and reduce costs.
I assume there's a high capital cost for this new hardware, but when they scale it up I wonder if the ratio of cost TPU to GPU will trend towards the ratio of power-per-Watt between the platforms? Seems like a natural limit, even if it never quite gets there.
[0] https://cloud.google.com/blog/big-data/2017/05/an-in-depth-l...
But you aren't paying for the electricity, you're paying for processing, which is an unconnected parameter. They only "sell" these chips per use, not on the open market
Presuming power is a major cost input (which I assume) their profit/op is much higher. So they could sell for less than an equivalent GPU and make more money. But they think they can get away with value pricing it (processing more/unit time is presumably worth it to many customers) and more power to them.
(that last bit was not an intentional pun; only noticed after typing it)
Depends on your metric, Jonathan! If you focus on the per dollar numbers, then it’s actually net favorable to the V100, because a second GPU over NVLINK won’t be as cost-efficient. If what you care about is raw throughput “in a single box”, then 8xV100 probably comes out ahead here.
Like someone else below though, I worry about the “hey wait a minute, changing the batch size just for the TPU seems unfair” and the whole “the LSTM didn’t converge” bit. Not a bad first draft, but hopefully the authors can do some more comparisons.
Thanks for your feedback. As I noted above, we will report further results with larger batch sizes (and smaller ones for the TPU). The LSTM not converging is one of the experiences we wanted to share. We are working on solving this issue and will update the post accordingly. Our goal is really a fair and valuable comparison, which is not easy, so we value all of the feedback.
(I'm obviously biased - I helped with parts of the cloud-side of cloud TPU - but I presume this comment stands on its own. :-)
Also, while I certainly agree that the performance/dollar comparison is highly relevant to customers at a given instant, that may only tell us that Google is subsidizing this hardware now that they've deployed it, and/or that, lacking serious competition, NVIDIA has been building crazy margins into their P100/V100 prices. In understanding fundamental technological tradeoffs, and even the limits of what the pricing in a more competitive market could be, it is relevant to compare performance per unit of hardware resources (mm^2, die/package, watt, GB of HBM, etc.)
In short, these comparisons are hard, and there is no one which tells a complete story. I pushed back because, while the post includes some nuance, it brushes a great deal under the rug and focuses primarily on problematic comparison
(Further disclosure: I'm at least the third person in this sub-thread with some Google Brain/Cloud affiliation. I am speaking in my independent academic voice. I also think TPUs are great, having them publicly available now is great, and competition and diversity of architectural approach in accelerators is great. I appreciate the effort of the authors, but think the subtlety of these comparisons requires serious care.)
It's all about performance per dollar.
Not necessarily. The DGX-1, for example, has pretty poor perf/$$ but reduces the time a data scientist spends waiting. For some organizations, their people time is so valuable that what matters is “what gets me my answers back faster”, because that employee is easily $100/hr+.
That’s actually why the 8xV100 with NVLINK is so attractive (and why the TPUs also have board to board networking, not just chip to chip).
I'd like to know perf/watt, for instance, even if it doesn't matter to the customer.
Point well taken, we'll make sure to add a comparison to 4 and 8 GPUs. For now, a "Cloud TPU" (containing 8 cores) seems to be the smallest unit to allocate. The question of what exactly makes up a single device and how many to compare against each other is not easy to answer.
Impressive results regardless though; quite a bit faster than V100 than the paper specs would suggest.
Good point, I agree that the FP16 GPU results should be closer or grouped with the TPU results. We'll try to update accordingly.
Note that the TPU supports larger batch sizes because it has more RAM. We tested multiple batch sizes for GPUs and reported the fastest one. We'll try increasing the batch sizes as far as possible and report. The overall comparison will likely not change by much - we saw speed increases of around 5% doubling the batch size from 64 to 128. (https://www.tensorflow.org/performance/benchmarks also reports numbers for batch sizes of 32 and 64 on the P100)
Oh! You should definitely say that. It's semi-reasonable then to choose the batch size that is optimal for the part. It'd be good to make sure this isn't why your LSTM didn't converge though...
They are claiming a 29x improvement in that area.
Link to NVIDIA documentation on mixed-precision TensorCores: https://devblogs.nvidia.com/inside-volta/
TPUv2 is specially optimized for deep learning.
Nvidia's Volta microarchitecture is graphics processor with additional tensor units. It's a General-purpose (GPGPU) chip designed with graphics and other scientific computing tasks in mind. Nvidia has enjoyed monopoly power in the market and single microarchitecture has been enough in every high performance category.
Next logical step for Nvidia is to develop specialized deep learning TPU to compete with TPUv2 and others.
I don't know, this benchmark seems to show V100 doing pretty well against a specialized ASIC. It may well be that all NVIDIA has to do is cut costs on V100 to make a two V100s about as expensive as the cloud TPUv2. With increased batch size, it looks like two V100s would have performance comparable to TPUv2.
The microarchctiecture has many unnecessary things and it's not optimized as a whole for deep learning.
I'm also not sure how we can take googles word for the numbers, since they might as well be eating a less-than-ideal power cost to promote their platform. Any upfront cost will probably offset by locked-in customers later on.
I might just be a bit cynical though.
If that is right, is the "Tensorflow-optimized" Resnet-50 using 16bit floats when running on TPUv2?
No, this is really only comparing the throughput on the devices. A thorough comparison should really focus on time to reach a certain quality - including all of the tricks available for a certain architecture.
These are only available on the Google Cloud right now. I don't think there are plans to sell them anytime soon.
Does this mean it's inference-only? (I only quickly scanned the article)
I wonder which is higher, the cost for creating the TPUs in terms of engineering and manufacturing or the cost differential in terms of usage as compared to NVIDIA's latest?
I worry about Google long term here. I am surprised the TPU doesn't kick the ass of the NVIDIA chips.
By the logic above, you would conclude that TPUv1 (the inference-only chip) might have been a mistake, but we’ve been very public about how it “saved us from building lots of datacenters”.
That wasn’t ever sold as part of Cloud, so the benefit there is all from the second bit you mentioned: cheaper and more efficient than GPUs at the time. The paper also goes into more detail, but the size of that initial engineering team and time to market were both quite small.
For training, before Volta (and kind of Pascal), GPUs were the best option but not particularly efficient. Volta does the same “we should have a single instruction that does lots of math in one shot” by cleverly reusing the existing functional units. That the V100 is a great chip, is a good outcome for the whole industry. But GPUs aren’t (and shouldn’t be) just focused on ML. My bet is that there’s still a decent amount of runway left in specialized chips for ML, just as GPUs carved out their own niche versus CPUs.
But again, the “even just for Google” benefit is really enormous so I wouldn’t assume that Cloud has to pay for the entire effort. Could GPU manufacturers improve the cost:performance ratio of ML workloads enough that Google doesn’t have to build TPUs anymore? Perhaps, but like the V100 improvements that would be a great outcome!
I wouldn't be surprised if public TPUs are to some degree a way to print money: at least for a while, Google can probably just rent out its unused capacity that it had already planned and paid for. :-)
30% cheaper e2e price for the company's first public offering, compared to the market leader's top-of-the-line chip sounds...pretty good to me?
It used to until Volta came out with basically TPUs embedded on the board. We will see if AMD will join them as Vega in theory should be around Volta as well, just tooling is not there.
Yeah, I'm a bit disappointed myself. When announced initially, it seemed Google had a huge lead. But they dragged their feet for two years getting it to market, and now NVidia is nipping at their heels already.
I suspect they are using the TPUs internally for competitive advantage, and these are the leftovers they are done with. They're probably using v4 or v5 internally already.
tl;dr TPU helps Google Clouds' network effect.
If they do not continue to improve on process, they will fall behind in just a few years.