Now take the same architecture and spawn a hundred processes. Those Foo objects now live in different physical pages and thus writes to them from all the processes live happily in L1 cache.
Obviously not all architectures work like this. If Foo is "really" shared, then nothing can help the contention. But usually it's not, it's just that the code was written by someone who didn't think about cache contention. That kind of performance bug is really easy to write. And a reasonable fix for it is "don't use multithreaded architectures when you don't need them".