So Intel probably shoved popcnt into the same category to keep the processor design simple
In the processor design I work on, we do register dependency checks by partitioning all instructions into a set of "timing classes" and checking the dispatch delay needed between dependent register producers and consumers across all possible timing class pairs. The delays vary depending on available forwarding networks, resource conflicts, etc. Often times we groups instructions into sub optimal timing classes to simplify other parts of the design or just to make the dispatch logic simpler.Intel's x86 core is waaaaay more complicated than the core I work on and has far more instructions, so I it's probably safe to say that they make these suboptimal classifications often. I strongly suspect that the false dependency was intentional and not a "hardware bug" as some of the StackOverflow comments seem to suggest.
We can only speculate, but it's likely that Intel has the same handling for a lot two-operand instructions. Common instructions like add, sub take two operands both of which are inputs. So Intel probably shoved popcnt into the same category to keep the processor design simple.
On the other hand, MOV doesn't read both operands either.
It would be interesting to see if the Intel C Compiler knows about this false dependency.
Specifically, allocator's handling of an instruction with a false dependency on register that's written to, coupled with multiple compilers being unaware of the false dependency.
The problem with the compilers was, that they where not aware of this behavior and thus generated sub-optimal code for this situation ... but compiler builders are also mere humans.
When you distill a loop until you're finding the exact bottleneck in the system (pipelining, branch prediction, etc) you need to be very very careful you're measuring what you think you are. Otherwise you'll end up in this situation where you're benchmarking a compiler...
I ended up getting a noticeable speed boost just by using sync += (uint32_t)clocks * (uint64_t)frequency; ... just a simple 32-bit x 64-bit multiply was quite a bit faster than a 64-bit x 64-bit multiply. (One had to be 64-bit to prevent the multiplication from overflowing, as one value was in the MHz range and the other could be up to ~2000 or so.)
I've observed this on both AMD and Intel amd64 CPUs. Not sure how that'd hold up on other CPUs. As always though, profile your code first, and only consider these types of tricks in hot code areas.
I'd much rather have numbered registers that can be used for anything than named registers that have usage limitations.
Any hope to see a Thumb mode for x86-64?