This plotty tool [0] seems interesting and valuable - but I'm not sure how it relates to the problem the author talks about.
Would it be possible to determine ahead of time what order would maximize performance, or would that require profiling?
With regards to alignment... do linkers typically pack objects so tightly that the start of each object isn't aligned on a cache line boundary? AFAIK they're typically 32, 64, or 128 bytes.
Probably, because caches line sizes are an implementation detail, not part of the architectural specification.
Maybe there are other causes as well.
> Would it be possible to determine ahead of time what order would maximize performance, or would that require profiling?
I think at the very least, you'd need profiling to determine the hot code path, and that can change depending on input...
It probably explains why different runs vary so widely, I always thought it was other things going on in the OS, never really thought about the caches, etc.