Machine Learning for Systems and Systems for Machine Learning [pdf] (opens in new tab)

(learningsys.org)

408 pointsandrew37268y ago47 comments

47 comments

TPUs are only one part of this eye-opening presentation. Skip to page 28, where Jeff starts talking about:

* Using reinforcement learning so the computer can figure out how to parallelize code and models on its own. In experiments, the machine beats human-designed parallelization.

* Replacing B-tree indices, hash maps, and Bloom filters with data-driven indices learned by deep learning models. In experiments, the learned indices outperform the usual stalwarts by a large margin in both computing cost and performance, and are auto-tuning.

* Using reinforcement learning to manage datacenter power. Machine intelligence outperforms human-designed energy-management policies.

* Using machine intelligence to replace user-tunable performance options in all software systems, eliminating the need to tweak them with command line parameters like --num-threads=16, --max-memory-use=104876, etc. Machine intelligence outperforms hand-tuning.

* Using machine intelligence for all tasks currently managed with heuristics. For example, in compilers: instruction scheduling, register allocation, loop nest parallelization strategies, etc.; in networking: TCP window size decisions, backoff for retransmits, data compression, etc.; in operating systems: process scheduling, buffer cache insertion/replacement, file system prefetching, etc.; in job scheduling systems: which tasks/VMs to co-locate on same machine, which tasks to pre-empt, etc.; in ASIC design: physical circuit layout, test case selection, etc. Machine intelligence outperforms human heuristics.

IN SHORT: machine intelligence (today, that means deep learning and reinforcement learning) is going to penetrate and ultimately control EVERY layer of the software stack, replacing human engineering with auto-tuning, self-improving, better-performing code.

Eye-opening.

candiodari8y ago

Ok, I can understand how a bloom filter can be replaced by a neural network predictive model. You could actually train it while stuff gets added. This would make adding somewhat more expensive, but ...

Ah so it appears they're advocating using neural networks as index functions to sorted arrays (hashmaps are simply sorted by hash instead of by something in the data).

So what they do is they take a FIXED set of data that you want to quickly lookup in, already sorted, train a model (2 layer 32 width, relu activation is one architecture, but they also train sequences of models, HUGE changes to error (as the cost of max and min error are huge, you minimize max error rather than average error)).

They have the following brilliant insight : an index over a database (which gives the position of the data given the search key) is a CDF (cumulative distribution function) ! That's brilliant ! Of course it is !

And of course, this is Google. Once you have an index trained (which is a linear operation), you can translate the neural network model directly into C++, and compile it into machine instructions that don't depend on anything like tensorflow libraries. The resulting code can be pasted into anything you want. This may work fast, but seems less then entirely practical ... although I guess you could do the same in Java far easier and you could just include that code.

Paper here: http://learningsys.org/nips17/assets/slides/dean-nips17.pdf

1 more reply

frankmcsherry8y ago

In case anyone wants to check out some pre-history, back in 2002 Manfred Warmuth et al.[0] were using learning (Weighted Majority) to drive systems components like cache replacement policy. I'm not sure where the work went from there, but add it to the pile of techniques.

[0]: https://users.soe.ucsc.edu/~sbrandt/papers/NIPS02.pdf

eternalban8y ago

Thanks for the link. Very interesting. I found this [1] from 2015.

Reading your cite, the practical issue seems to me to be that the optimizer's memory footprint costs may in fact negate any benefit (e.g. ~40% over LRU) obtained in reducing cache misses.

My gut feeling is that this approach (for online systems) may work best with a hardware component (a card hosting the 'experts' and their virtual model e.g. the "virtual cache"). The distributed variant also seems worth exploring.

[1]: https://arxiv.org/pdf/1403.0388.pdf

jamesblonde8y ago

Good summary. Some systems groups are already going in this direction. PeletonDB is trying to use DL to build a self-tuning DB https://github.com/cmu-db/peloton We have been trying to self-tune resource management decisions in Hadoop YARN using deep learning.

31reasons8y ago

So basically it will replace all heuristics/greedy optimization algorithms. I am wondering if ML can come up with better sorting algorithms, or I guess when you can use ML for end strategy of optimization you don't have to sort!

killjoywashere8y ago

I think the genomics folks have been onboard with this for a couple years now.

est8y ago

I remember there was a joke that in google's code base, there are more Bayesian cases than if...else...

pramodliv18y ago

It was a quote in Joel Spolsky's blog.

> A very senior Microsoft developer who moved to Google told me that Google works and thinks at a higher level of abstraction than Microsoft. “Google uses Bayesian filtering the way Microsoft uses the if statement,” he said

https://www.joelonsoftware.com/2005/10/17/news-121/

laythea8y ago

"replacing human engineering"

Good summary, but someone still has to write the machine intelligence!

p1esk8y ago

Great comment. Fits most AGI discussions.

jacksmith210068y ago

Great summary. Thanks and agree this is a big deal.

cobookman8y ago

Nvidia Titan V can do 110 TFLOPS, 12GB of 1.7 Gb/s Memory [1] and sells for 3,000$. TPU v2 does 180 TFLOPS, 64GB of 19.2Gb/s Memory [2].

That's a heck of a performance boost for a chip that's likely costing google way less than the nvidia flagship.

[1] http://www.tomshardware.com/news/nvidia-titan-v-110-teraflop...

nl8y ago

It'd be really interesting to know the per-unit math on that.

Designing and taping out a new ASIC isn't cheap.

Presumably Google needs to use a fairly recent process (22nm or better?), which means GlobalFoundaries/TSMC or Samsung (do any of the Chinese native fabs have 22nm yet?). I wonder who us building them?

So many questions...

argonaut8y ago

The TFLOPS numbers are not directly comparable. The TPUs use reduced precision in some areas, whereas I am guessing the Titan V numbers are based on single precision operations.

modeless8y ago

Titan V numbers are also reduced precision (16 bits), using their tensor cores.

cobookman8y ago

If all you need is reduced precision, then it’s fair to compare the two. I’d also assume memory bandwidth matters just as much as TFLOPS for ml workloads.

shaklee38y ago

It's not clear to me how programmable the tpu is. I'm sure it's great at convolutions and matrix multiplies. Can it do anything else?

jlebar8y ago

Without speaking to the capabilities of TPUs, note that most ML models today are mostly convolutions and matrix multiplies.

EvgeniyZh8y ago

Neither do tensor cores

1 more reply

PeterisP8y ago

What else should it be doing?

It's an accelerator to run Tensorflow graphs, and TF graphs essentially are converted to matrix operations and convolutions.

jamesblonde8y ago

Great talk, with lots of new insights into what's happening at Google. I really think his point that ImageNet is the new Mnist now holds true. Even research labs should be buying DeepLearning11 servers (10 x 1080Ti) for $15k, and training large models in a reasonable amount of time. It may seem that Google are way ahead, but they are just doing synchronous SGD, and it was interesting to see the drop in prediction accuracy from 128 TPU2 cores to 256 TPU2 cores for ImageNet (76 -> 75% accuracy). So, the algorithms for dist. training aren't unknown, and with cheap hardware like the DL11 server, many well-financed research groups can compete with this.

eggie58y ago

ballpark how much would it cost to train ImageNet (ILSVRC) on a std deep CNN arch (VGG or inception) on AWS using a p2 or p3?

jamesblonde8y ago

Ballpark - 1100 dollars on AWS. 44hr 28min (from Dawnbench - http://dawn.cs.stanford.edu/benchmark/ ) on a DGX-1 (cost 24.48 dollars/hour on p3.16xlarge). https://aws.amazon.com/ec2/pricing/on-demand/

On a DL11 server, it will take about 60 hrs, and only cost you 15k upfront. The economics speak for themselves for fp32 training, at this moment in time.

1 more reply

larelli8y ago

It looks like this paper has more information: https://arxiv.org/pdf/1712.01208v1.pdf

EvgeniyZh8y ago

Was it filmed? If yes, when video will be available?

swah8y ago

Yep - not very useful without the video.

laythea8y ago

But we have HN comments section!

nickpsecurity8y ago

Great presentation. Far as application, I already thought this might be useful in lightweight, formal methods to spot problems and suggest corrections for failures in Rust's borrow checkers, separation logic on C programs, proof tactics, and static analysis tooling. For Rust example, the person might try to express a solution in the language that fails the borrow checker. If they can't understand why, they submit it to the system that attempts to spot where the problem is. The system might start with humans spotting it and restructuring the code to pass borrow checker. Every instance of those will feed into the learning system that might eventually do that on its own. There's also potential to use automated, equivalence checks/tests between user-submitted code and the AI's suggestions to help human-in-the-loop decide if it's worth review before passing onto the other person.

In hardware, both digital and analog designers seem to use lots of heuristics in how they design things. Certainly could help there. Might be especially useful in analog due to small number of experienced engineers available.

yeukhon8y ago

While this is a collective work, honestly, after hearing about JD for so many years: is there anything he CAN’T do?

justicezyx8y ago

He did little for tpu.

1024core8y ago

This is some really cool stuff, I hope this submission gets more upvotes and reaches a wider audience.

novaRom8y ago

I speculate that Google will sell TPUv2 for as less as 500 USD per PCIe card already in 2018. Nvidia's Volta TensorCores are essentially the same: 32-bit accumulators and 16-bit multipliers, but GPUs are more general-purpose which is not necessary for Deep Learning since most intensive operation is dot-product (y+=w*x).

quadrature8y ago

I feel like the cloud play would be much stronger than entering the hardware market.

nl8y ago

That "Learned Index Structures" makes it pretty clear that Karpathy was right in his widely criticized "Software 2.0" piece.

ekr8y ago

I haven't read that paper (Learned Index Structures), but things like gperf have existed for decades. Are these enhanced data structures dynamic, i.e. unlike gperf which is a static one, does it reoptimize as you insert new elements?

In the case of the hash table, I assume it's using the model to compute the hash function.

sanxiyn8y ago

No, it doesn't handle inserts. On the other hand, the paper writes:

"An ... approach to handling inserts is to build a delta-index. All inserts are kept in buffer and from time to time merged with a potential retraining of the model."

1 more reply

nl8y ago

This is not like gperf.

1 more reply

Nydhal8y ago

I thought it was a stretch when reading that medium post. Now reading this and thinking about it after finishing two undergraduate classes one on operating systems and another on compilers, the machine learning for systems part makes a lot of sense, apart from the heuristics the learned index structures idea is just fascinating.

Another illuminating sentence from the paper was this:

>This leads to an interesting observation: a model which predicts the position given a key inside a sorted array effectively approximates the cumulative distribution function (CDF). We can model the CDF of the data to predict the position as: p = F(Key) ∗ N

Maybe it's just my very limited knowledge as an undergrad but I'm feeling that this can be the start of something big. Another idea that just came to me after is how much of this ML is applicable to the domain of cryptography. In my security class it seemed like much of the famous hash functions for example were somehow "found" in vast space of potential schemes.

oh-kumudo8y ago

I think you got the point though. This is going to be HUGE. But why the downvotes? Job security concern?

Nydhal8y ago

It could be. CS and more generally tech related fields have this positive feedback loop where the technology facilitates it's own development.

For example:

The easiest and most abundant thing to learn on the web is unsurprisingly web development.

An ML engineer can use ML to optimize the data structures that he uses for his models.

I could not say the same about fields like biology or physics.

1 more reply

nl8y ago

Judging by the downvotes some people really don't like that idea.

j / k navigate · click thread line to collapse

47 comments

cs7028y ago

TPUs are only one part of this eye-opening presentation. Skip to page 28, where Jeff starts talking about:

* Using reinforcement learning so the computer can figure out how to parallelize code and models on its own. In experiments, the machine beats human-designed parallelization.

* Using reinforcement learning to manage datacenter power. Machine intelligence outperforms human-designed energy-management policies.

Eye-opening.

candiodari8y ago

Ah so it appears they're advocating using neural networks as index functions to sorted arrays (hashmaps are simply sorted by hash instead of by something in the data).

Paper here: http://learningsys.org/nips17/assets/slides/dean-nips17.pdf

1 more reply

frankmcsherry8y ago

[0]: https://users.soe.ucsc.edu/~sbrandt/papers/NIPS02.pdf

eternalban8y ago

Thanks for the link. Very interesting. I found this [1] from 2015.

Reading your cite, the practical issue seems to me to be that the optimizer's memory footprint costs may in fact negate any benefit (e.g. ~40% over LRU) obtained in reducing cache misses.

[1]: https://arxiv.org/pdf/1403.0388.pdf

jamesblonde8y ago

31reasons8y ago

killjoywashere8y ago

I think the genomics folks have been onboard with this for a couple years now.

est8y ago

I remember there was a joke that in google's code base, there are more Bayesian cases than if...else...

pramodliv18y ago

It was a quote in Joel Spolsky's blog.

https://www.joelonsoftware.com/2005/10/17/news-121/

laythea8y ago

"replacing human engineering"

Good summary, but someone still has to write the machine intelligence!

p1esk8y ago

Great comment. Fits most AGI discussions.

jacksmith210068y ago

Great summary. Thanks and agree this is a big deal.

cobookman8y ago

Nvidia Titan V can do 110 TFLOPS, 12GB of 1.7 Gb/s Memory [1] and sells for 3,000$. TPU v2 does 180 TFLOPS, 64GB of 19.2Gb/s Memory [2].

That's a heck of a performance boost for a chip that's likely costing google way less than the nvidia flagship.

[1] http://www.tomshardware.com/news/nvidia-titan-v-110-teraflop...

nl8y ago

It'd be really interesting to know the per-unit math on that.

Designing and taping out a new ASIC isn't cheap.

So many questions...

argonaut8y ago

The TFLOPS numbers are not directly comparable. The TPUs use reduced precision in some areas, whereas I am guessing the Titan V numbers are based on single precision operations.

modeless8y ago

Titan V numbers are also reduced precision (16 bits), using their tensor cores.

cobookman8y ago

If all you need is reduced precision, then it’s fair to compare the two. I’d also assume memory bandwidth matters just as much as TFLOPS for ml workloads.

shaklee38y ago

It's not clear to me how programmable the tpu is. I'm sure it's great at convolutions and matrix multiplies. Can it do anything else?

jlebar8y ago

Without speaking to the capabilities of TPUs, note that most ML models today are mostly convolutions and matrix multiplies.

EvgeniyZh8y ago

Neither do tensor cores

1 more reply

PeterisP8y ago

What else should it be doing?

It's an accelerator to run Tensorflow graphs, and TF graphs essentially are converted to matrix operations and convolutions.

jamesblonde8y ago

eggie58y ago

ballpark how much would it cost to train ImageNet (ILSVRC) on a std deep CNN arch (VGG or inception) on AWS using a p2 or p3?

jamesblonde8y ago

On a DL11 server, it will take about 60 hrs, and only cost you 15k upfront. The economics speak for themselves for fp32 training, at this moment in time.

1 more reply

larelli8y ago

It looks like this paper has more information: https://arxiv.org/pdf/1712.01208v1.pdf

EvgeniyZh8y ago

Was it filmed? If yes, when video will be available?

swah8y ago

Yep - not very useful without the video.

laythea8y ago

But we have HN comments section!

nickpsecurity8y ago

yeukhon8y ago

While this is a collective work, honestly, after hearing about JD for so many years: is there anything he CAN’T do?

justicezyx8y ago

He did little for tpu.

1024core8y ago

This is some really cool stuff, I hope this submission gets more upvotes and reaches a wider audience.

novaRom8y ago

quadrature8y ago

I feel like the cloud play would be much stronger than entering the hardware market.

nl8y ago

That "Learned Index Structures" makes it pretty clear that Karpathy was right in his widely criticized "Software 2.0" piece.

ekr8y ago

In the case of the hash table, I assume it's using the model to compute the hash function.

sanxiyn8y ago

No, it doesn't handle inserts. On the other hand, the paper writes:

"An ... approach to handling inserts is to build a delta-index. All inserts are kept in buffer and from time to time merged with a potential retraining of the model."

1 more reply

nl8y ago

This is not like gperf.

1 more reply

Nydhal8y ago

Another illuminating sentence from the paper was this:

oh-kumudo8y ago

I think you got the point though. This is going to be HUGE. But why the downvotes? Job security concern?

Nydhal8y ago

It could be. CS and more generally tech related fields have this positive feedback loop where the technology facilitates it's own development.

For example:

The easiest and most abundant thing to learn on the web is unsurprisingly web development.

An ML engineer can use ML to optimize the data structures that he uses for his models.

I could not say the same about fields like biology or physics.

1 more reply

nl8y ago

Judging by the downvotes some people really don't like that idea.

j / k navigate · click thread line to collapse