Cache Oblivious Algorithms (opens in new tab)

(jiahai-feng.github.io)

218 pointstardygrade5y ago19 comments

19 comments

I implemented a bunch of cache oblivious search trees and benchmarked them against their classical counterparts. The results showed that while the cache oblivious ones were more cache efficient, they were slower overall.

The implementations and benchmarks can be found here if anyone is curious: https://gitlab.com/VladLazar/cache-oblivious-data-structures

corysama5y ago

Your readme lists the cache-oblivious data structures you tested. But, it would be nice to also list what baselines you tested them against. I’m curious if they were explicitly cache-aware.

herbstein5y ago

I don't really have anything to add, other than pointing out how weird it is seeing a lecturer I've interacted with over the course of my studies (Gerth Brodal) casually mentioned in a "random" git repo.

remcob5y ago

I tried cache-oblivious algorithms for some numerical code, but found them underwhelming. The cache-oblivious model does not consider the effects of prefetching and cache line associativity. Both can make a huge difference.

They are a great starting point and work well for the RAM/Swap level of the memory hierarchy, but for the L1/L2 caches manual tuning still pays off. This unfortunately makes the implementation non-oblivious and even depended on the specific CPU used.

diroussel5y ago

Yeah. Thinking of them as a starting point is a good way to think of the approach.

I had some code that was slower than I intuitively expected it to be. I looked around for approaches and came across cache oblivious algorithms. In my case I was thinking of RAM as a cache and some of the inputs would fit in RAM and some were larger than RAM. All inputs and outputs were in network drives. The final output was 4.5 GB. Using cache obvious algorithms as a inspiration I managed to get my runtime down from 5 mins to 20 seconds (numbers are approximate as this was nearly ten years ago now).

I was quite pleased.

jwatzman5y ago

I'd not heard about van Emde Boas layout before, and it sounds super cool! I wish the author had included a simple example before focussing on the general case -- I find it hard to get an intuition for the general without a specific example! In particular, after some searching, I found the diagram at the top-left of page 4 of https://www.cs.au.dk/~gerth/papers/soda02.pdf to be immensely helpful in getting my head around the concept.

ReactiveJelly5y ago

How hard is it to empirically detect cache size at startup and tune the algorithms based on that?

I guess you'd miss out on inlining / constant folding opportunities. If you didn't have a JIT.

jazzkingrt5y ago

Certainly possible. Glibc has the sysconf function which can provide the cache sizes at runtime.

https://www.gnu.org/software/libc/manual/html_node/Constants...

However, consider a scenario like a database where you might store on disk once and don't get cheap opportunities to reorganize data when the memory or processor hardware is switched out.

Or when your data is sent over the network for processing by some arbitrary client hardware.

TristanDaCunha5y ago

I think there might be a mistake in this article, or at least something requiring clarification.

> For convenience, suppose the binary tree is complete and has height H=2^K.

What is K? It's never stated. I'd usually assume H = log N, if N is the number of nodes in a balanced tree.

DavidSJ5y ago

I think K is just an arbitrary natural number. The author is assuming the height is a power of 2, and giving the name K to the log_2 of that height.

CogitoCogito5y ago

I love cache oblivious algorithms in principle, but I feel they break my ultimate rule: you can’t please everyone. At the end of the day, the restrictions presented by your system are everything. Without those restrictions you have no problem to solve. Not taking them into account seems crazy.

However avoiding details as much as is feasible is very powerful and should be kept in mind. I feel like simple approaches like streaming computations (not necessarily cache oblivious) share many of the same underlying principles.

legulere5y ago

Don’t most CPUs have 64 byte cache lines anyway?

karmakaze5y ago

It's not only about the smallest unit but every cache level unit size. Imagine if the packed structure was on SSD and mapped to linear memory addresses. Certainly at the smallest scale the cache line is what matters. But there's also benefits of having likely to be accessed data be contiguous at other scales, e.g. os memory page, SSD i/o unit.

stephencanon5y ago

128B is pretty common (especially for outer caches, but even for L1 on some ARM cores), and 32B is not unheard of.

dragontamer5y ago

I'm fairly certain that DDR4 has 64-byte bursts (64-bit data bus x 8 length bursts per command == 64-bytes per DDR4 operation). I'd expect all modern systems with DDR4 controllers to have 64-byte cache lines or greater.

LPDDR4 is a totally different protocol however. Maybe 32-bytes is optimal on cell phones... I don't know much about that.

loeg5y ago

All AMD64 anyway.

whoevercares5y ago

I remember my professor Michael Bender teaching those M/B stuff and it was interesting (somehow related to origami). Sadly I pretty much forgot everything now.

red_dragon5y ago

Hey there Seawolf! Which year were you there? Did you take CSE548?

whoevercares5y ago

Was 2011, long time ago. Yes I think it was CSE 548 and some seminars. It was an easy course to pass since Bender is super nice even though the topic is hard

j / k navigate · click thread line to collapse

19 comments

vld_lzr5y ago

The implementations and benchmarks can be found here if anyone is curious: https://gitlab.com/VladLazar/cache-oblivious-data-structures

corysama5y ago

Your readme lists the cache-oblivious data structures you tested. But, it would be nice to also list what baselines you tested them against. I’m curious if they were explicitly cache-aware.

herbstein5y ago

remcob5y ago

diroussel5y ago

Yeah. Thinking of them as a starting point is a good way to think of the approach.

I was quite pleased.

jwatzman5y ago

ReactiveJelly5y ago

How hard is it to empirically detect cache size at startup and tune the algorithms based on that?

I guess you'd miss out on inlining / constant folding opportunities. If you didn't have a JIT.

jazzkingrt5y ago

Certainly possible. Glibc has the sysconf function which can provide the cache sizes at runtime.

https://www.gnu.org/software/libc/manual/html_node/Constants...

However, consider a scenario like a database where you might store on disk once and don't get cheap opportunities to reorganize data when the memory or processor hardware is switched out.

Or when your data is sent over the network for processing by some arbitrary client hardware.

TristanDaCunha5y ago

I think there might be a mistake in this article, or at least something requiring clarification.

> For convenience, suppose the binary tree is complete and has height H=2^K.

What is K? It's never stated. I'd usually assume H = log N, if N is the number of nodes in a balanced tree.

DavidSJ5y ago

I think K is just an arbitrary natural number. The author is assuming the height is a power of 2, and giving the name K to the log_2 of that height.

CogitoCogito5y ago

legulere5y ago

Don’t most CPUs have 64 byte cache lines anyway?

karmakaze5y ago

stephencanon5y ago

128B is pretty common (especially for outer caches, but even for L1 on some ARM cores), and 32B is not unheard of.

dragontamer5y ago

LPDDR4 is a totally different protocol however. Maybe 32-bytes is optimal on cell phones... I don't know much about that.

loeg5y ago

All AMD64 anyway.

whoevercares5y ago

I remember my professor Michael Bender teaching those M/B stuff and it was interesting (somehow related to origami). Sadly I pretty much forgot everything now.

red_dragon5y ago

Hey there Seawolf! Which year were you there? Did you take CSE548?

whoevercares5y ago

Was 2011, long time ago. Yes I think it was CSE 548 and some seminars. It was an easy course to pass since Bender is super nice even though the topic is hard

j / k navigate · click thread line to collapse