The cost to train an AI system is improving at 50x the pace of Moore’s Law (opens in new tab)

(ark-invest.com)

254 pointskayza5y ago54 comments

54 comments

Resnet-50 with DawnBench settings is a very poor choice for illustrating this trend. The main technique driving this reduction in cost-to-train has been finding arcane, fast training schedules. This sounds good until you realize its a type of sleight of hand where finding that schedule takes tens of thousands of dollars (usually more) that isn't counted in cost-to-train, but is a real-world cost you would experience if you want to train models.

However, I think the overall trend this article talks about is accurate. There has been an increased focus on cost-to-train and you can see that with models like EfficientNet where NAS is used to optimize both accuracy and model size jointly.

sdenton45y ago

I would guess that this means DawnBench is basically working. You'll get some "overfit" training schedule optimizations, but hopefully amongst those you'll end up with some improvements you can take to other models.

We also seem to be moving more towards a world where big problem-specific models are shared (BERT, GPT), so that the base time to train doesn't matter much unless you're doing model architecture research. For most end-use cases in language and perception, you'll end up picking up a 99%-trained model, and fine tuning on your particular version of the problem.

calebkaiser5y ago

This is an odd framing.

Training has become much more accessible, due to a variety of things (ASICs, offerings from public clouds, innovations on the data science side). Comparing it to Moore's Law doesn't make any sense to me, though.

Moore's Law is an observation on the pace of increase of a tightly scoped thing, the number of transistors.

The cost of training a model is not a single "thing," it's a cumulative effect of many things, including things as fluid as cloud pricing.

Completely possible that I'm missing something obvious, though.

adrianmonk5y ago

> Comparing it to Moore's Law doesn't make any sense to me, though.

I assume it's meant as a qualitative comparison rather than a meaningful quantitative one. Sort of a (sub-)cultural touchstone to illustrate a point about which phase of development we're in.

With CPUs, during the phase of consistent year after year exponential growth, there were ripple effects on software. For example, for a while it was cost-prohibitive to run HTTPS for everything, then CPUs got faster and it wasn't anymore. So during that phase, you expected all kinds of things to keep changing.

If deep learning is in a similar phase, then whatever the numbers are, we can expect other things to keep changing as a result.

Const-me5y ago

> then CPUs got faster and it wasn't anymore

The enabling tech was AES-NI instruction set, not the speed.

Agree on the rest. The main reason why modern CPUs and GPUs all have 16-bit floats is probably the deep learning trend.

1 more reply

gumby5y ago

Like many things, Moore’s law is garbled when adopted by analogy outside its domain.

What does “more transistors” mean? To you, it means just what Gordon Moore means when he said it: opportunity for more function in same space/cost.

The laypersons, marketing grabbed the term and said it would imply “faster”. Which then was absurdly conflated with CPU clock speed (itself an important input, though hardly the only one, determining the actual speed of A system).

The use here is of the “garbled analogy” sort which surely is the dominant use today.

bcrosby955y ago

Yes but that aspect of Moore's law for CPUs expired over a decade ago. It's the whole reason we got multicore in the first place.

1 more reply

staycoolboy5y ago

Agreed, but Moore's Law has morphed to refer to both xtors and performance despite his original phrasing.

The biggest innovation I've seen is in the cloud: backplane I/O and memory is essential and up until a few years ago there weren't many cloud configurations suitable for massive amount of I/O.

jessriedel5y ago

Ok, but achieving Moore's law has required combining an enormous number of conceptually distinct technical insights. Both training costs and transistor density seem like well-defined single parameters that incorporate many small complicated effects.

Rastonbury5y ago

Implicit in moore's law is that cost does not increase in the same way, if not prices of chips would also be doubling. Something more analogously sounding would be the decrease of cost per transistor.

The number of transistors is also not dependent on a single thing, it can be argued many macro events contributed since the 80s, the VC model for chipmakers in SV, the rise of the internet, going fabless, rise of mobile, innovations in fabrication technology.

lukevp5y ago

What are some domains that a solo developer could build something commercially compelling to capture some of this $37 trillion? Are there any workflows or tools or efficiencies that could be easily realized as a commercial offering that would not require massive man hours to implement?

jacquesm5y ago

Take any domain that requires classification work that has not yet been targeted and make a run for it. You likely will be able to adapt one of the existing nets or even use transfer learning to outperform a human. That's the low hanging fruit.

For instance: quality control: abnormality detection (for instance: in medicine), agriculture (lots of movement there right now), parts inspection, assembly inspection, sorting and so on. There are more applications for this stuff than you might think at first glance, essentially if a toddler can do it and it is a job right now that's a good target.

Barrin925y ago

> abnormality detection (for instance: in medicine), agriculture (lots of movement there right now), parts inspection, assembly inspection, sorting and so on

none of these is anything someone can run from their bedroom because they have very high quality and regulatory requirements and require constant work outside of the actual AI training.

This is actually reflected in the margins of "AI" companies, which are significantly lower than traditional SAAS businesses and require significantly more manpower to deal with the long tailed problems, which is where the AI fails but it's what actually matters.

1 more reply

yelloweyes5y ago

anything that's even remotely profitable is already taken

2 more replies

vertak5y ago

You can give this article by Chip Huyen a read. Mayhaps you will find a niche for a solo or small dev team. Though it is focused on MLOps if that makes a different for the type of niche you're looking for.

https://huyenchip.com/2020/06/22/mlops.html

op035y ago

Extracting and selling data stuck in the mountain ranges of pdfs and other useless formats in every large corp, org, govt dept on the planet.

Do it for a couple publicly available docs and then contact the org saying you offer 'archive digitization' so their data ppl can mine for intelligence.

Most of the time and resources of 'Digital Transformation'/Data Science Depts goes to just manually extracting info from all kinds of old docs, pdfs, spreadsheets containing institutional knowledge.

Isinlor5y ago

You need to be creative. But one example - colorizing old photos: https://twitter.com/citnaj

cft5y ago

The cost of training is decreasing, but the meaningfully large and non-trivial training sets are almost exclusively in the domain of large companies, economically inaccessible to the individual developers/startups.

godzillabrennus5y ago

This is a space I worked on during the crypto boom of 2017/2018.

The opportunity is present for a decentralized network that allows for training of models to be done from training sets at facilities.

Think of all the data sitting in silos from clinical trials. There is of course the painful process of authenticating researchers for access to data like that but it can be done. There just needs to be an economic reason to make that kind of effort.

I got pulled into a direction of using ML to predict costs of care in insurance so didn’t go further down the rabbit hole but I did author a patent for a novel approach to have a decentralized identity exchange data.

If any of this sounds exciting to you feel free to email me. hn (at) strapr (dot) com

KorfmannArno5y ago

krisp.ai but using gpu (also on mac) and with desktop version for ubuntu linux.

anonu5y ago

Ark Invest are the creators of the ARKK [1] and ARKW ETFs that have become retail darlings, mainly because they're heavily invested in TSLA.

They pride themselves on this type of fundamental, bottom up analysis on the market.

It's fine.. I don't know if I agree with using Moore's law which is fundamentally about hardware, with the cost to run a "system" which is a combination of customized hardware and new software techniques

[1] https://pages.etflogic.io/?ticker=ARKK

1 more reply

gchamonlive5y ago

I remember this article from 2018: https://medium.com/the-mission/why-building-your-own-deep-le...

Hackernews discussion for the article: https://news.ycombinator.com/item?id=18063893

It really is interesting how this is changing the dynamics of neural network training. Now it is affordable to train a useful network on the cloud, whereas 2 years ago that would be reserved to companies with either bigger investments or an already consolidated product.

qayxc5y ago

> Now it is affordable to train a useful network on the cloud

I honestly don't see how anything changed significantly in past 2 years. Benchmarks indicate that a V100 is barely 2x the performance of an RTX 2080 Ti [1] and a V100 is

• $2.50/h at Google [2]

• $13.46/h (4xV100) at Microsoft Azure [3]

• $12.24/h (4xV100) at AWS [4]

• ~$2.80/h (2xV100, 1 month) at LeaderGPU [5]

• ~$3.38/h (4xV100, 1 month) at Exoscale [6]

Other smaller cloud providers are in a similar price range to [5] and [6] (read: GCE, Azure and AWS are way overpriced...).

Using the 2x figure from [1] and adjusting the price for the build to a 2080 Ti and an AMD R9 3950X instead of the TR results in similar figures to the article you provided.

Please point me to any resources that show how the content of the article doesn't apply anymore, 2 years later. I'd be very interested to learn what actually changed (if anything).

NVIDIA's new A100 platform might be a game changer, but it's not yet available in public cloud offerings.

[1] https://lambdalabs.com/blog/best-gpu-tensorflow-2080-ti-vs-v...

[2] https://cloud.google.com/compute/gpus-pricing

[3] https://azure.microsoft.com/en-us/pricing/details/virtual-ma...

[4] https://aws.amazon.com/ec2/pricing/on-demand/

[5] https://www.leadergpu.com/#chose-best

[6] https://www.exoscale.com/gpu/

solidasparagus5y ago

You are missing TPU and spot/preemptible pricing, which need to be considered when we are talking about training cost. The big one to me is the ability to consistently train on V100s with spot pricing, which was not possible a couple of years ago (there wasn't enough spare capacity). Also, the improvement in cloud bandwidth for DL-type instances has helped distributed training a lot.

sabalaba5y ago

Nothing really has changed in the last two years in terms of training cost. I think the author is making unreasonable extrapolations based on changes in performance on the Dawn benchmarks. A lot of the results are fast but require a lot more compute / search time to find the best parameters and training regimen that lead to those fast convergence times. (Learning rate schedule, batch size, image size schedules, etc.) The point being that once the juice is squeezed out you aren’t going to continue to see training convergence time improvements on the same hardware.

Also, because you cited our GPU benchmarks, I also wanted to throw in a mention our GPU instances which have some of the lowest training costs on the Stanford Dawn Benchmarks discussed in the article.

https://lambdalabs.com/service/gpu-cloud

robecommerce5y ago

Another data point:

"For example, we recently internally benchmarked an Inferentia instance (inf1.2xlarge) against a GPU instance with an almost identical spot price (g4dn.xlarge) and found that, when serving the same ResNet50 model on Cortex, the Inferentia instance offered a more than 4x speedup."

https://towardsdatascience.com/why-every-company-will-have-m...

1 more reply

gchamonlive5y ago

I don't really know if those hardware breakthroughs that the article refers to already reflects in Cloud GPU performance, but software reflects nonetheless. So even though pricing has fluctuated marginally since 2018, it is just plain faster to train a neural network today because of software advances, from what I understood.

1 more reply

mtgp10005y ago

I trained a useful neural network and prototyped a viable [failed] startup technology something like 4 years ago on a 1080ti with a mid range CPU. It was enough to get me meetings with a couple of the largest companies in the world.

Yeah it took 12-24 hours to do what I could login to AWS and accomplish in minutes with parallel GPUs...but practical solutions were already in reach. The primary changes now are buzz and possibly unprecedent rate of research progress.

ersiees5y ago

I would really like a thorough analysis on how expensive it is to multiply large matrices, which is the most expensive part of a transformer training for example according to the profiler. Is there some Moore’s law or similar trend?

1 more reply

mellosouls5y ago

It is regrettable if an equivalent to the self-fulfilling prophecy of Moore's "Law" (originally an astute observation and forecast, but not remotely a law) became a driver/limiter in this field as well, even more so if it's a straight transplant for soundbite reasons rather than through any impartial and thoughtful analysis.

kens5y ago

One thing I've wondered is if Moore's Law is good or bad, in the sense of how fast should we have been able to improve IC technology. Was progress limited by business decisions or is this as fast as improvements could take place?

A thought experiment: suppose we meet aliens who are remarkably similar to ourselves and have an IC industry. Would they be impressed by our Moore's law progress, or wonder why we took so long?

NortySpock5y ago

https://en.wikipedia.org/wiki/Moore%27s_law, third paragraph of the header, claims that Moore's Law drove targets in R&D and manufacturing, but does not cite a reference for this claim.

"Moore's prediction has been used in the semiconductor industry to guide long-term planning and to set targets for research and development."

imtringued5y ago

I'm not sure what the point of that question is. In theory you could have a government subsidize construction of fabs so that skipping nodes is feasible but why on earth would you do that when the industry is fully self sufficient and wildly profitable?

gxx5y ago

The cost to collect the huge amounts of needed to train meaningful models is surely not growing at this rate.

gentleman115y ago

Despite nvidia vaguely prohibiting users from using their desktop cards for machine learning in any sort of data center-like or server-like capacity. Hopefully AMDs ml support / OpenCl will continue improving

QuixoticQuibit5y ago

Last I saw, they don’t even support ROCm on their recent Navi cards, so I’d be hesitant.

Reelin5y ago

Wow. This is really disappointing to see. (https://github.com/RadeonOpenCompute/ROCm/issues/887)

I guess PlaidML might be a viable option?

sktguha5y ago

Does it mean that the cost to train something like gpt3 by OpenAI will reduce from 12 million dollars to less next year ? If so how much will it reduce to ?

m3kw95y ago

It was probably because very inefficient to begin with.

techbio5y ago

Indeed nonexistent

bra-ket5y ago

"AI" is not really appropriate name for what it is

seek3r005y ago

tl;dr: Training learners is becoming cheaper every year, thanks to big tech companies pushing hardware and software.

j / k navigate · click thread line to collapse

54 comments

solidasparagus5y ago

sdenton45y ago

calebkaiser5y ago

This is an odd framing.

Moore's Law is an observation on the pace of increase of a tightly scoped thing, the number of transistors.

The cost of training a model is not a single "thing," it's a cumulative effect of many things, including things as fluid as cloud pricing.

Completely possible that I'm missing something obvious, though.

adrianmonk5y ago

> Comparing it to Moore's Law doesn't make any sense to me, though.

I assume it's meant as a qualitative comparison rather than a meaningful quantitative one. Sort of a (sub-)cultural touchstone to illustrate a point about which phase of development we're in.

If deep learning is in a similar phase, then whatever the numbers are, we can expect other things to keep changing as a result.

Const-me5y ago

> then CPUs got faster and it wasn't anymore

The enabling tech was AES-NI instruction set, not the speed.

Agree on the rest. The main reason why modern CPUs and GPUs all have 16-bit floats is probably the deep learning trend.

1 more reply

gumby5y ago

Like many things, Moore’s law is garbled when adopted by analogy outside its domain.

What does “more transistors” mean? To you, it means just what Gordon Moore means when he said it: opportunity for more function in same space/cost.

The use here is of the “garbled analogy” sort which surely is the dominant use today.

bcrosby955y ago

Yes but that aspect of Moore's law for CPUs expired over a decade ago. It's the whole reason we got multicore in the first place.

1 more reply

staycoolboy5y ago

Agreed, but Moore's Law has morphed to refer to both xtors and performance despite his original phrasing.

The biggest innovation I've seen is in the cloud: backplane I/O and memory is essential and up until a few years ago there weren't many cloud configurations suitable for massive amount of I/O.

jessriedel5y ago

Rastonbury5y ago

lukevp5y ago

jacquesm5y ago

Barrin925y ago

> abnormality detection (for instance: in medicine), agriculture (lots of movement there right now), parts inspection, assembly inspection, sorting and so on

none of these is anything someone can run from their bedroom because they have very high quality and regulatory requirements and require constant work outside of the actual AI training.

1 more reply

yelloweyes5y ago

anything that's even remotely profitable is already taken

2 more replies

vertak5y ago

https://huyenchip.com/2020/06/22/mlops.html

op035y ago

Extracting and selling data stuck in the mountain ranges of pdfs and other useless formats in every large corp, org, govt dept on the planet.

Do it for a couple publicly available docs and then contact the org saying you offer 'archive digitization' so their data ppl can mine for intelligence.

Most of the time and resources of 'Digital Transformation'/Data Science Depts goes to just manually extracting info from all kinds of old docs, pdfs, spreadsheets containing institutional knowledge.

Isinlor5y ago

You need to be creative. But one example - colorizing old photos: https://twitter.com/citnaj

cft5y ago

godzillabrennus5y ago

This is a space I worked on during the crypto boom of 2017/2018.

The opportunity is present for a decentralized network that allows for training of models to be done from training sets at facilities.

If any of this sounds exciting to you feel free to email me. hn (at) strapr (dot) com

KorfmannArno5y ago

krisp.ai but using gpu (also on mac) and with desktop version for ubuntu linux.

anonu5y ago

Ark Invest are the creators of the ARKK [1] and ARKW ETFs that have become retail darlings, mainly because they're heavily invested in TSLA.

They pride themselves on this type of fundamental, bottom up analysis on the market.

[1] https://pages.etflogic.io/?ticker=ARKK

1 more reply

gchamonlive5y ago

I remember this article from 2018: https://medium.com/the-mission/why-building-your-own-deep-le...

Hackernews discussion for the article: https://news.ycombinator.com/item?id=18063893

qayxc5y ago

> Now it is affordable to train a useful network on the cloud

I honestly don't see how anything changed significantly in past 2 years. Benchmarks indicate that a V100 is barely 2x the performance of an RTX 2080 Ti [1] and a V100 is

• $2.50/h at Google [2]

• $13.46/h (4xV100) at Microsoft Azure [3]

• $12.24/h (4xV100) at AWS [4]

• ~$2.80/h (2xV100, 1 month) at LeaderGPU [5]

• ~$3.38/h (4xV100, 1 month) at Exoscale [6]

Other smaller cloud providers are in a similar price range to [5] and [6] (read: GCE, Azure and AWS are way overpriced...).

Using the 2x figure from [1] and adjusting the price for the build to a 2080 Ti and an AMD R9 3950X instead of the TR results in similar figures to the article you provided.

Please point me to any resources that show how the content of the article doesn't apply anymore, 2 years later. I'd be very interested to learn what actually changed (if anything).

NVIDIA's new A100 platform might be a game changer, but it's not yet available in public cloud offerings.

[1] https://lambdalabs.com/blog/best-gpu-tensorflow-2080-ti-vs-v...

[2] https://cloud.google.com/compute/gpus-pricing

[3] https://azure.microsoft.com/en-us/pricing/details/virtual-ma...

[4] https://aws.amazon.com/ec2/pricing/on-demand/

[5] https://www.leadergpu.com/#chose-best

[6] https://www.exoscale.com/gpu/

solidasparagus5y ago

sabalaba5y ago

https://lambdalabs.com/service/gpu-cloud

robecommerce5y ago

Another data point:

https://towardsdatascience.com/why-every-company-will-have-m...

1 more reply

gchamonlive5y ago

1 more reply

mtgp10005y ago

ersiees5y ago

1 more reply

mellosouls5y ago

kens5y ago

A thought experiment: suppose we meet aliens who are remarkably similar to ourselves and have an IC industry. Would they be impressed by our Moore's law progress, or wonder why we took so long?

NortySpock5y ago

https://en.wikipedia.org/wiki/Moore%27s_law, third paragraph of the header, claims that Moore's Law drove targets in R&D and manufacturing, but does not cite a reference for this claim.

"Moore's prediction has been used in the semiconductor industry to guide long-term planning and to set targets for research and development."

imtringued5y ago

gxx5y ago

The cost to collect the huge amounts of needed to train meaningful models is surely not growing at this rate.

gentleman115y ago

QuixoticQuibit5y ago

Last I saw, they don’t even support ROCm on their recent Navi cards, so I’d be hesitant.

Reelin5y ago

Wow. This is really disappointing to see. (https://github.com/RadeonOpenCompute/ROCm/issues/887)

I guess PlaidML might be a viable option?

sktguha5y ago

Does it mean that the cost to train something like gpt3 by OpenAI will reduce from 12 million dollars to less next year ? If so how much will it reduce to ?

m3kw95y ago

It was probably because very inefficient to begin with.

techbio5y ago

Indeed nonexistent

bra-ket5y ago

"AI" is not really appropriate name for what it is

seek3r005y ago

tl;dr: Training learners is becoming cheaper every year, thanks to big tech companies pushing hardware and software.

j / k navigate · click thread line to collapse