Would assume Google is able to do that because of the less power required.
I am actually more curious to get a paper on the new speech NN Google is using. Suppose to be 16k samples a second through a NN is hard to imagine how they did that and was able to roll it out as you would think the cost would be prohibitive.
You are ultimately competing with a much less compute heavy solution.
https://cloudplatform.googleblog.com/2018/03/introducing-Clo...
Suspect this was only possible because of the TPUs.
Can't think of anything else where controlling the entire stack including the silicon would be more important than AI applications.
You can’t buy a TPU, it’s a cloud only thing. They also show it’s not a huge difference in both perf and time to converge (albeit only one architecture)
I would say kudos to V100 and this benchmark that breaks the TPU hype.
Or maybe reading it wrong?
That is close to half has much to use Google is it not?
BTW, The TPUs are also about twice as fast also.
Sounds like Google is pretty far ahead of Nvidia. Which really just makes sense as Google does the entire stack and just going to have the data to optimize the silicon.
About half the cost is hype?
I want in the cloud and not have to deal with updating, etc. Would think most are the same for anything of any scale. Could not imagine any longer building up rigs and dealing with all the issues. Plus much harder to scale.
TPUv2 is has 1.27x-1.86x the images/s/$.
And the other chart titled: Cost to reach 75.7% top-1 accuracy.
Where TPUv2 costs 62.5% the reserved GPU instance and 42.6% the unreserved GPU cost.
Key takeaway from the article:
> While the V100s perform similarly fast, the higher price and slower convergence of the implementation results in a considerably higher cost-to-solution.
No matter how fast a system is on the inside you have to get data in and out of it -- at the very least to memory. SRAM takes too much area and there is a limit DRAM bandwidth despite technologies such as eDRAM and HBM. Some tasks are compute intensive, but for general tasks, a processor that is 100x faster would need 100x faster memory to really be 100x faster.
Thus advances in real-life performance are likely to be more like a factor of 2.
For training I never pay full price in the AWS cloud, rather I run interruptable instances and pay a fraction of the list price. People I know who train in the Google cloud seem to get interrupted all the time even though they are paying full price.
Inference is another story. Once you have the trained model, you will usually need to run inference many many more times than you run training and this gets more so the bigger scale you are running at. That hits your unit costs and it is where you need to pinch every penny.
Depends on how much you plan to use the hardware. If it's running near continuously, total cost of ownership is very important. Power costs can quickly dominate TCO.
> While the V100s perform similarly fast, the higher price and slower convergence of the implementation results in a considerably higher cost-to-solution.
You can definitely do this on a GPU. We use the older auto-regressive WaveNets (not Parallel Wavenet) for inference on GPUs, with the newly released nv-wavenet code. Here's a link to a blog post about it:
https://devblogs.nvidia.com/nv-wavenet-gpu-speech-synthesis
That code will generate audio samples at 48khz, or if you're worried about throughput, it'll do a batch of 320 parallel utterances at 16khz.
I would expect a dedicated accelerator to need at least a 5-10X advantage to outweigh all the other infrastructure and ecosystem costs.
GPUs are more useful for a wide variety of data-parallel tasks, and many more NN frameworks work on top of CUDA than work on the TPU.
In terms of horizontal scalability, nvidia has been rapidly iterating on increasing both memory and interlink bandwidth (including NVSwitch [1]), while each 'TPU' is actually 4 chips interconnected so likely has less upward scalability.
Also note that the tensor cores on a V100 take roughly 25-30% of the actual area. If Nvidia wanted to, they could probably easily make a pure tensor chip that beat the TPU in performance, could be produced in volume on their existing process, and also had full compatibility with their entire stack.
All in all, a 2x price/performance advantage for a hyper-specialized accelerator is basically a loss, just like how nobody installs a Soundblaster card anymore, how consumer desktops don't run discrete GPUs even though integrated graphics are a few times slower, or
[1] https://www.nextplatform.com/2018/04/04/inside-nvidias-nvswi...
Happy to answer questions!
EDIT: you don't get sustained use discounts, either, at the moment. You can get either for GCP GPUs, though. Perhaps that will change once TPUs are out of beta?
Any idea of how much variation in accuracy you get on different training runs of the same model on the same hardware? My understanding is that model quality can and does vary from one run to the next on these kinds of large datasets - from a single observation, it's hard to know if the difference is real or noise.
But what's going on when some of the implementations of a standard algorithm don't converge, and different hardware has different accuracy rates on the same algorithm? Are DNNs really that flaky? And does it really make sense to be doing performance comparisons when the accuracy performance doesn't match?
Is the root problem that ResNet-50 works best with a smaller batch size?
And how do you do meaningful research into new DNNs if there's always an "Maybe if I ran it again over there I'd get better results" factor?
Thank you.
That is not all that close is it?
Note also, that the ~2% performance difference is only on one model (ResNet-50) and cannot be generalized to all workloads/all of deep learning (at least not without further proof).
the TPU implementation applies very compute-intensive image pre-processing steps and actually sacrifices raw throughput
Thanks
In terms of how much compute power the TPU pre-processing needs I only have very rough numbers: I ran the same pre-processing while training ResNet-50 on a node with 4 GPUs and it was consistently utilizing >22 CPU cores (including all of the other CPU-tasks while training).
It's great that there is now wider choice of (pre-trained?) models formulated for mixed-precision training.
When I was comparing Titan V (~V100) and 1080ti 5 months ago, I was only able to get 90% increase in forward-pass speed for Titan V (same batch-size), even with mixed-precision. And that was for an attention-heavy model, where I expected Titan V to show its best. Admittedly, I was able to use almost double the batch-size on Titan V, when doing mixed-precision. And Titan V draws half the power of 1080ti too :)
At the end my conclusion was: I am not a researcher, I am a practitioner - I want to do transfer learning or just use existing pre-trained models - without tweaking them. For that, tensor cores give no benefit.
Yes, thanks for mentioning that! That's what the article is alluding to at the end. There's also something like a "cost-to-model" and that's influenced by how easy it is to make efficient use of the performance and how much tweaking it needs. It's also influenced by the framework you use... However, that's difficult to compare and almost impossible to measure.
After 59 days of playing with it, I sent it back (initiated return on 30th day, after I already figured out it doesn't live up to the hype, then had another 30 days to actually send it back).
With $3,000 I can buy 4 1080ti's, while only two are necessary to beat Titan V (in Titan V's best game). I only bought one though. NowInStock.net helped with buying 1080ti directly from Nvidia.
AMD will enter the game soon once they get their software working, Intel will follow.
I suspect that Nvidia will respond with its own specialized machine learning and inference chips to match the cost/performance ratio. As long as Nvidia can maintain high manufacturing volumes and small performance edge, they can still make good profits.
But the TPUs are half the cost per this article?
Plus Google does the entire stack and can better optimize the hardware versus Nvidia. So it seem Google can improve faster I would think.
If there ever was a huge advantage doing the entire stack it is with neural networks.
A perfect example is Google new speech doing 16k samples a second with a NN.
https://cloudplatform.googleblog.com/2018/03/introducing-Clo...
Do not think Google could offer this service as a competitive cost without the TPUs.
This new method is replacing the method that was far less compute intensive so to offer at a competitive price requires lowering compute cost which suspect is only possible with the TPUs.
Exactly. Nvidia can match the performance already without 100% specialized processor. It's the just the price they need to cut by optimizing their architecture for tensor processing and reducing their profits when competition emerges.
Google is not in the business of becoming a major chip maker or competing with Nvidia head on. Putting hundreds of millions into new microarchitecture every second year eats lots of resources. They just want competitive market and the prices to go down.
Can't you just buy some 1080s for cheaper than this. I understand there is electricity and hosting costs, but cloud computing seems expensive compared to buying equipment.
So the restriction applies to the 1080ti but _not_ the titan V. I completely agree the restriction is total bullshit but it's important to get the facts straight.
Sidenote: Love the illustrations that accompany most of your blog posts, are they drawn by an in-house artist/designer?
The illustrations are from an artist/designer we contract from time to time. I agree, his work is awesome!
Kudos to them; they are awesome!
That said, they fixed this on NVSwitch so it's just another HW hiccup like int8 was on Pascal.
You have price per hour and performance per second. Thus that ratio is not performance per image per $, you need to scale that. Also, the metric is not "images per second per $", but just "images per $".
EDIT: I found [1] which describes "tensor cores", "vector/matrix units" and HBM interfaces. The design sounds similar in concept to GPUs. Maybe they don't have or need interpolation hw or other GPU features?
Suspect we will need a gen 3 to get a paper on the gen 2.
Here is the gen 1 paper and highly recommend. Pretty interesting using 65536 very simple cores.
https://supercomputersfordl2017.github.io/Presentations/Imag... http://learningsys.org/nips17/assets/slides/dean-nips17.pdf
For the last version of the TPU, Google provided more detail, e.g., in this paper:
https://arxiv.org/pdf/1704.04760.pdf
Hopefully, Google will publish something similar for TPUv2, but I have no knowledge whether or when that might happen.
Definitely, no need to do any kind of rasterization here.
I wonder whether NVLink would make any difference for Resnet-50. Does anyone know whether these implementations require any inter-GPU communication?
Because Intel was involved in its development and made a number of tweaks to improve performance.
Be curious if it actually was significant or not.
A bit odd that the TPUs are provisioned on such a weaker machine compared to the V100s, especially when there were comparisons which included augmentation and other processing outside of the TPU.