The implementations and benchmarks can be found here if anyone is curious: https://gitlab.com/VladLazar/cache-oblivious-data-structures
They are a great starting point and work well for the RAM/Swap level of the memory hierarchy, but for the L1/L2 caches manual tuning still pays off. This unfortunately makes the implementation non-oblivious and even depended on the specific CPU used.
I had some code that was slower than I intuitively expected it to be. I looked around for approaches and came across cache oblivious algorithms. In my case I was thinking of RAM as a cache and some of the inputs would fit in RAM and some were larger than RAM. All inputs and outputs were in network drives. The final output was 4.5 GB. Using cache obvious algorithms as a inspiration I managed to get my runtime down from 5 mins to 20 seconds (numbers are approximate as this was nearly ten years ago now).
I was quite pleased.
I guess you'd miss out on inlining / constant folding opportunities. If you didn't have a JIT.
https://www.gnu.org/software/libc/manual/html_node/Constants...
However, consider a scenario like a database where you might store on disk once and don't get cheap opportunities to reorganize data when the memory or processor hardware is switched out.
Or when your data is sent over the network for processing by some arbitrary client hardware.
> For convenience, suppose the binary tree is complete and has height H=2^K.
What is K? It's never stated. I'd usually assume H = log N, if N is the number of nodes in a balanced tree.
However avoiding details as much as is feasible is very powerful and should be kept in mind. I feel like simple approaches like streaming computations (not necessarily cache oblivious) share many of the same underlying principles.
LPDDR4 is a totally different protocol however. Maybe 32-bytes is optimal on cell phones... I don't know much about that.