However, I think the overall trend this article talks about is accurate. There has been an increased focus on cost-to-train and you can see that with models like EfficientNet where NAS is used to optimize both accuracy and model size jointly.
We also seem to be moving more towards a world where big problem-specific models are shared (BERT, GPT), so that the base time to train doesn't matter much unless you're doing model architecture research. For most end-use cases in language and perception, you'll end up picking up a 99%-trained model, and fine tuning on your particular version of the problem.
Training has become much more accessible, due to a variety of things (ASICs, offerings from public clouds, innovations on the data science side). Comparing it to Moore's Law doesn't make any sense to me, though.
Moore's Law is an observation on the pace of increase of a tightly scoped thing, the number of transistors.
The cost of training a model is not a single "thing," it's a cumulative effect of many things, including things as fluid as cloud pricing.
Completely possible that I'm missing something obvious, though.
I assume it's meant as a qualitative comparison rather than a meaningful quantitative one. Sort of a (sub-)cultural touchstone to illustrate a point about which phase of development we're in.
With CPUs, during the phase of consistent year after year exponential growth, there were ripple effects on software. For example, for a while it was cost-prohibitive to run HTTPS for everything, then CPUs got faster and it wasn't anymore. So during that phase, you expected all kinds of things to keep changing.
If deep learning is in a similar phase, then whatever the numbers are, we can expect other things to keep changing as a result.
The enabling tech was AES-NI instruction set, not the speed.
Agree on the rest. The main reason why modern CPUs and GPUs all have 16-bit floats is probably the deep learning trend.
What does “more transistors” mean? To you, it means just what Gordon Moore means when he said it: opportunity for more function in same space/cost.
The laypersons, marketing grabbed the term and said it would imply “faster”. Which then was absurdly conflated with CPU clock speed (itself an important input, though hardly the only one, determining the actual speed of A system).
The use here is of the “garbled analogy” sort which surely is the dominant use today.
The biggest innovation I've seen is in the cloud: backplane I/O and memory is essential and up until a few years ago there weren't many cloud configurations suitable for massive amount of I/O.
The number of transistors is also not dependent on a single thing, it can be argued many macro events contributed since the 80s, the VC model for chipmakers in SV, the rise of the internet, going fabless, rise of mobile, innovations in fabrication technology.
For instance: quality control: abnormality detection (for instance: in medicine), agriculture (lots of movement there right now), parts inspection, assembly inspection, sorting and so on. There are more applications for this stuff than you might think at first glance, essentially if a toddler can do it and it is a job right now that's a good target.
none of these is anything someone can run from their bedroom because they have very high quality and regulatory requirements and require constant work outside of the actual AI training.
This is actually reflected in the margins of "AI" companies, which are significantly lower than traditional SAAS businesses and require significantly more manpower to deal with the long tailed problems, which is where the AI fails but it's what actually matters.
Do it for a couple publicly available docs and then contact the org saying you offer 'archive digitization' so their data ppl can mine for intelligence.
Most of the time and resources of 'Digital Transformation'/Data Science Depts goes to just manually extracting info from all kinds of old docs, pdfs, spreadsheets containing institutional knowledge.
The opportunity is present for a decentralized network that allows for training of models to be done from training sets at facilities.
Think of all the data sitting in silos from clinical trials. There is of course the painful process of authenticating researchers for access to data like that but it can be done. There just needs to be an economic reason to make that kind of effort.
I got pulled into a direction of using ML to predict costs of care in insurance so didn’t go further down the rabbit hole but I did author a patent for a novel approach to have a decentralized identity exchange data.
If any of this sounds exciting to you feel free to email me. hn (at) strapr (dot) com
They pride themselves on this type of fundamental, bottom up analysis on the market.
It's fine.. I don't know if I agree with using Moore's law which is fundamentally about hardware, with the cost to run a "system" which is a combination of customized hardware and new software techniques
Hackernews discussion for the article: https://news.ycombinator.com/item?id=18063893
It really is interesting how this is changing the dynamics of neural network training. Now it is affordable to train a useful network on the cloud, whereas 2 years ago that would be reserved to companies with either bigger investments or an already consolidated product.
I honestly don't see how anything changed significantly in past 2 years. Benchmarks indicate that a V100 is barely 2x the performance of an RTX 2080 Ti [1] and a V100 is
• $2.50/h at Google [2]
• $13.46/h (4xV100) at Microsoft Azure [3]
• $12.24/h (4xV100) at AWS [4]
• ~$2.80/h (2xV100, 1 month) at LeaderGPU [5]
• ~$3.38/h (4xV100, 1 month) at Exoscale [6]
Other smaller cloud providers are in a similar price range to [5] and [6] (read: GCE, Azure and AWS are way overpriced...).
Using the 2x figure from [1] and adjusting the price for the build to a 2080 Ti and an AMD R9 3950X instead of the TR results in similar figures to the article you provided.
Please point me to any resources that show how the content of the article doesn't apply anymore, 2 years later. I'd be very interested to learn what actually changed (if anything).
NVIDIA's new A100 platform might be a game changer, but it's not yet available in public cloud offerings.
[1] https://lambdalabs.com/blog/best-gpu-tensorflow-2080-ti-vs-v...
[2] https://cloud.google.com/compute/gpus-pricing
[3] https://azure.microsoft.com/en-us/pricing/details/virtual-ma...
[4] https://aws.amazon.com/ec2/pricing/on-demand/
Also, because you cited our GPU benchmarks, I also wanted to throw in a mention our GPU instances which have some of the lowest training costs on the Stanford Dawn Benchmarks discussed in the article.
"For example, we recently internally benchmarked an Inferentia instance (inf1.2xlarge) against a GPU instance with an almost identical spot price (g4dn.xlarge) and found that, when serving the same ResNet50 model on Cortex, the Inferentia instance offered a more than 4x speedup."
https://towardsdatascience.com/why-every-company-will-have-m...
Yeah it took 12-24 hours to do what I could login to AWS and accomplish in minutes with parallel GPUs...but practical solutions were already in reach. The primary changes now are buzz and possibly unprecedent rate of research progress.
A thought experiment: suppose we meet aliens who are remarkably similar to ourselves and have an IC industry. Would they be impressed by our Moore's law progress, or wonder why we took so long?
"Moore's prediction has been used in the semiconductor industry to guide long-term planning and to set targets for research and development."
I guess PlaidML might be a viable option?