undefined | Better HN

0 pointsreacharavindh3y ago0 comments

This reminds me of that “todo” I wrote for myself a long time ago. These days processors come with bigger L1,L2, and L3 caches. Would it be possible for a program that works on a tiny bit of data(few KB) to load it all up in the cache and provide ultimate response times?!

Are there any directives to the Operating System to say - “here keep this data in the fastest accessible L[1,2,3] please”?

0 comments

21433y ago

> Are there any directives to the Operating System to say - “here keep this data in the fastest accessible L[1,2,3] please”?

I'm probably the worst person to explain this.

Long long ago, I took a parallel programming class in grad school.

It turns out the conventional way to do matrix multiplication results in plenty of cache misses.

However, if you carefully tweak the order of the loops and do certain minor modifications — I forget the details — you could substantially increase the cache hits and make matrix multiplication go noticeably faster on benchmarks.

Some random details that may be relevant:

* When the processor loads a single number M[x][y], it sort of loads in the adjacent numbers as well. You need to take advantage of this.

* Something about row-major/column-major array is an important detail.

What I'm trying to say is, it is possible to indirectly optimize cache hits by careful manual hand tweaking. I don't know if there's a general automagic way to do this though.

This probably wasn't very useful, but I'm just putting it out there. Maybe more knowledgeable folks can explain this better.

matthias5093y ago

I’m guessing you are thinking of cache lines where CPUs will read/write data in 64 byte chunks from memory to/from cache.

As you said, being aware of this lets you optimize away cache misses by controlling the memory access pattern.

diroussel3y ago

There is a data access strategy called cache oblivious algorithms which aim to make it more likely to utilise this property without knowing the actual cache size.

I used that approach once on a batch job that read two multi megabytes files to produce a multigigabyte output file. It gave a massive speed up on at 32-bit intel machine.

https://en.wikipedia.org/wiki/Cache-oblivious_algorithm

dragontamer3y ago

> Are there any directives to the Operating System to say - “here keep this data in the fastest accessible L[1,2,3] please”?

Not for general purpose programs, because L1 caches change so quickly each year there is no point.

For embedded real-time processors, yes. For GPUs, yes. (OpenCL __local, CUDA __shared__).

This is because Microsoft's DirectX platform guarantees 32kB or something of __shared__ / tiled memory, so all GPU providers who want a DirectX11 certification are guaranteed to have that cache-like memory that programmers can rely upon. When DirectX12 or DirectX13 comes about, the new minimum specifications are published and all graphics programmers can then take advantage of it.

-------

No sane Linux/Windows programmer however would want these kinds of guarantees for normal CPU programs, outside of very strict realtime settings (at which point, you can rely upon the hardware being constant). Linux/Windows are designed as general purpose OSes.

DirectX 9 / 10 / 11 / 12 however, is willing to tie itself to the "GPUs of the time", and includes such specifications.

l33t23283y ago

I don’t think you can generally control the cache with such granularity since modern processors do all sorts of instruction level parallelism and cache coherency voodoo

deredede3y ago

On CPUs you can't really force data to stay on the cache, but if you access it frequently and there is not too much load, it will stay there anyways.

Some architectures (e.g. GPUs) provide local "scratchpad" memories instead of (or in addition to) caches. These are separate uninitialized adressable memory region with similar access times to a L2/L1 cache.

erwincoumans3y ago

Explicit control over the fastest memory is what GPU local storage or the PlayStation 3 Cell SPUs allow/require.

For x86_64 there are cache hints, no pinning/reserving parts of the caches (as far ad I know).

I wonder if Apple M1 or M2 cpu with unified CPU/GPU memory has anything like pinning or explicit cache control?

jcranberry3y ago

If the data is contiguous in memory and frequently accessed it will almost certainly make its way into L1 cache and be there for the life of the program.

If the data is not contiguous it could make the CPU's life much harder.

There's also the matter of program size (the amount of instructions in the actual program) and whether the program does anything which forces it to go lower cache levels or RAM.

There are intrinsics for software prefetching such as __mm_prefetch, but those are difficult to use such that they actually increase you're performance.

jodrellblank3y ago

One of the claimed advantages of the array programming language "K" is that the interpreter is small enough for the hot paths to stay in the CPU cache. It's hard to Google but claims come from people/places like this thread: https://news.ycombinator.com/item?id=15908394

j / k navigate · click thread line to collapse

0 comments

21433y ago

> Are there any directives to the Operating System to say - “here keep this data in the fastest accessible L[1,2,3] please”?

I'm probably the worst person to explain this.

Long long ago, I took a parallel programming class in grad school.

It turns out the conventional way to do matrix multiplication results in plenty of cache misses.

Some random details that may be relevant:

* When the processor loads a single number M[x][y], it sort of loads in the adjacent numbers as well. You need to take advantage of this.

* Something about row-major/column-major array is an important detail.

What I'm trying to say is, it is possible to indirectly optimize cache hits by careful manual hand tweaking. I don't know if there's a general automagic way to do this though.

This probably wasn't very useful, but I'm just putting it out there. Maybe more knowledgeable folks can explain this better.

matthias5093y ago

I’m guessing you are thinking of cache lines where CPUs will read/write data in 64 byte chunks from memory to/from cache.

As you said, being aware of this lets you optimize away cache misses by controlling the memory access pattern.

diroussel3y ago

There is a data access strategy called cache oblivious algorithms which aim to make it more likely to utilise this property without knowing the actual cache size.

I used that approach once on a batch job that read two multi megabytes files to produce a multigigabyte output file. It gave a massive speed up on at 32-bit intel machine.

https://en.wikipedia.org/wiki/Cache-oblivious_algorithm

dragontamer3y ago

> Are there any directives to the Operating System to say - “here keep this data in the fastest accessible L[1,2,3] please”?

Not for general purpose programs, because L1 caches change so quickly each year there is no point.

For embedded real-time processors, yes. For GPUs, yes. (OpenCL __local, CUDA __shared__).

-------

DirectX 9 / 10 / 11 / 12 however, is willing to tie itself to the "GPUs of the time", and includes such specifications.

l33t23283y ago

I don’t think you can generally control the cache with such granularity since modern processors do all sorts of instruction level parallelism and cache coherency voodoo

deredede3y ago

On CPUs you can't really force data to stay on the cache, but if you access it frequently and there is not too much load, it will stay there anyways.

erwincoumans3y ago

Explicit control over the fastest memory is what GPU local storage or the PlayStation 3 Cell SPUs allow/require.

For x86_64 there are cache hints, no pinning/reserving parts of the caches (as far ad I know).

I wonder if Apple M1 or M2 cpu with unified CPU/GPU memory has anything like pinning or explicit cache control?

jcranberry3y ago

If the data is contiguous in memory and frequently accessed it will almost certainly make its way into L1 cache and be there for the life of the program.

If the data is not contiguous it could make the CPU's life much harder.

There's also the matter of program size (the amount of instructions in the actual program) and whether the program does anything which forces it to go lower cache levels or RAM.

There are intrinsics for software prefetching such as __mm_prefetch, but those are difficult to use such that they actually increase you're performance.

jodrellblank3y ago

j / k navigate · click thread line to collapse