(I have no special insight into what Intel and AMD CPUs do under the hood, but my best guess has always been that they are implemented by ucode that has no dependencies on anything in the register file except whatever might be internal to the ucode for the instruction itself. And the dispatch logic will cheerfully schedule it as such, including moving it before loads that precede in the instruction stream. Since RDTSC itself isn’t a load, the magic that makes all loads be acquires does not apply. RDTSCP is probably an excessively heavily pessimized version that waits for earlier loads to actually happen. The really nice hypothetical version where RDTSC “loads” a virtual loadable register in the coherency domain and can be speculated just like a real load is probably too complex to be worth implementing.)