For all others with burst workloads training in the cloud can make sense, but that has been the case for a while already.
... uh, you sure about that? Let me go check on the 3 models I have concurrently training for my organization on 3 separate GPU servers (all 2 year old hardware to boot) that have been running continuously for the past 36 hours. It pretty much works out to 24/7 training for the past several months.
And BTW, this is massively cheaper for us than training in the cloud.
Pretraining BERT takes 44 minutes on 1024 V100 GPUs [1]
This requires dedicated instances, since shared instances won't be able to get to peak performance if only because of the "noisy neighbour"-effect.
At GCP, a V100 costs $2.48/h [2], so Microsoft's experiment would've cost $2,539.52.
Smaller providers offer the same GPU at just $1.375/h [3], so a reasonable lower limit would be around $1,408.
For a single BERT pretraining, provided highly optimised workflows and distributed training scripts are already at hand, renting a GPU for single training tasks seems to be the way to go.
The cost of V100-equivalent end-user hardware (we don't need to run in a datacentre, dedicated workstations will do), is about $6,000 (e.g. a Quadro RTX 6000), provided you don't need double precision. The card will have equal FP32 performance, lower TGP and VRAM that sits between the 16 GB and 32 GB version of the V100.
Workstation hardware to go with such card will cost about $2,000, so $8,000 are a reasonable cost estimation. The cost of electricity varies between regions, but in the EU the average non-household price is about 0.13€/kWh [4].
Pretraining BERT therefore costs an estimated 1024 h * 0.13€/kWh * 0.5 kW ≈ 57€ in electricity (power consumption estimated from TGP + typical power consumptions of an Intel Xeon workstation from my own measurements when training models).
In order to get the break-even point we can use the following equation: t * $1,408 = $8,000 + t * $69, which results in t = 8,000/(1408-69) or t > 5.
In short, if you pretrain BERT 6 times, you safe money by BUYING a workstation and running it locally over renting cloud GPUs from a reasonably cheap provider.
This example only concerns BERT, but you can use the same reasoning for any model that you know the required compute time and VRAM requirements of.
This only concerns training, too - inference is a whole different can of worms entirely.
[1] https://www.deepspeed.ai/news/2020/05/27/fastest-bert-traini...
[2] https://cloud.google.com/compute/gpus-pricing
[3] https://www.exoscale.com/syslog/new-tesla-v100-gpu-offering/
[4] https://ec.europa.eu/eurostat/statistics-explained/index.php...