what happens if those 16 instructions touch 16 different cache lines? I'm not an hardware expert (and even less on coherency protocols), but I think it would be extremely hard to make sure livelocking is avoided in all cases, short of having some extremely expensive and heavy handed global 'bus' lock fallback.