There's `perf stat` that uses CPU performance counters to give you a high-level view if your workload is stalling due to waiting on memory: https://stackoverflow.com/questions/22165299/what-are-stalle.... However it won't tell you exactly where the problem is, just that there is a problem. You can do `perf record` on your process, and then running `perf report` on the generated data. You'll see what functions and what lines/instructions are taking the most time. Most of the time it will be pretty obvious that it's a memory bottleneck because it will be some kind of assignment or lookup.
If you're using an intel processor, VTune is extremely detailed. Here's a nice article from Intel on using it: https://www.intel.com/content/www/us/en/docs/vtune-profiler/... . You'll see one of the tables in the articles lists functions as "memory bound" - most time is spent waiting on memory, as opposed to executing computations.
We experiment with KeyDB too but I'm not sure what state that project is in.
A single CPU core actually can execute more than one instruction at a time, by leveraging out of order execution. The trick for leveraging out of order execution is avoiding having data dependencies locally. By swapping the iteration order, they allow the CPU core to continue with the next iteration before the previous has finished. Why? Because there is no data dependency anymore!
I haven't profiled that code, but I guess that now the bottleneck would be the sum. But it doesn't matter, as accesing a register is the fastest operation. Accesing the memory cache is slower, and accessing RAM is even slower.
Algorithms and methodologies don't require being well-known or being well-maintained to be valid and useful comparisons.
The problem with linked lists is the memory address of nodes isn't necessarily contiguous because of malloc and the value could be NULL? Why does interleave loop make it faster for the cpu? It still a linked list, arbitrary memory, could be NULL? Not sure what im missing here?
If there are two lists, in the first example, you're doing:
a1 -> b1 -> c1 -> d1
a2 -> b2 -> c2
In the 2nd example, you're doing [a1, a2]
[b1, b2]
[c1, c2]
[d1]
Visiting the same number of nodes, but because the nodes are referenced from an array, when you load `a1` you're [probably] also going to load `a2` into the cache.Yep, the problem with a linked list is that you don't know the address of the next pointer, it could be allocated anywhere. However, if you have a continuous array, when you read the first item, you're also fetching the next items in the cache line.