Are there any directives to the Operating System to say - “here keep this data in the fastest accessible L[1,2,3] please”?
I'm probably the worst person to explain this.
Long long ago, I took a parallel programming class in grad school.
It turns out the conventional way to do matrix multiplication results in plenty of cache misses.
However, if you carefully tweak the order of the loops and do certain minor modifications — I forget the details — you could substantially increase the cache hits and make matrix multiplication go noticeably faster on benchmarks.
Some random details that may be relevant:
* When the processor loads a single number M[x][y], it sort of loads in the adjacent numbers as well. You need to take advantage of this.
* Something about row-major/column-major array is an important detail.
What I'm trying to say is, it is possible to indirectly optimize cache hits by careful manual hand tweaking. I don't know if there's a general automagic way to do this though.
This probably wasn't very useful, but I'm just putting it out there. Maybe more knowledgeable folks can explain this better.
As you said, being aware of this lets you optimize away cache misses by controlling the memory access pattern.
I used that approach once on a batch job that read two multi megabytes files to produce a multigigabyte output file. It gave a massive speed up on at 32-bit intel machine.
Not for general purpose programs, because L1 caches change so quickly each year there is no point.
For embedded real-time processors, yes. For GPUs, yes. (OpenCL __local, CUDA __shared__).
This is because Microsoft's DirectX platform guarantees 32kB or something of __shared__ / tiled memory, so all GPU providers who want a DirectX11 certification are guaranteed to have that cache-like memory that programmers can rely upon. When DirectX12 or DirectX13 comes about, the new minimum specifications are published and all graphics programmers can then take advantage of it.
-------
No sane Linux/Windows programmer however would want these kinds of guarantees for normal CPU programs, outside of very strict realtime settings (at which point, you can rely upon the hardware being constant). Linux/Windows are designed as general purpose OSes.
DirectX 9 / 10 / 11 / 12 however, is willing to tie itself to the "GPUs of the time", and includes such specifications.
Some architectures (e.g. GPUs) provide local "scratchpad" memories instead of (or in addition to) caches. These are separate uninitialized adressable memory region with similar access times to a L2/L1 cache.
For x86_64 there are cache hints, no pinning/reserving parts of the caches (as far ad I know).
I wonder if Apple M1 or M2 cpu with unified CPU/GPU memory has anything like pinning or explicit cache control?
If the data is not contiguous it could make the CPU's life much harder.
There's also the matter of program size (the amount of instructions in the actual program) and whether the program does anything which forces it to go lower cache levels or RAM.
There are intrinsics for software prefetching such as __mm_prefetch, but those are difficult to use such that they actually increase you're performance.