Here's icpc 14.0.3 vs g++ 4.8.1 on a Sandy Bridge E5-1620 @ 3.60GHz and a Haswell i7-4770 CPU @ 3.40GHz.
nate@sandybridge:~/tmp$ g++ -O3 -march=native -std=c++11 popcnt-dependency.cpp -o popcnt-dependency
nate@sandybridge:~/tmp$ popcnt-dependency 1
unsigned 41959360000 0.608615 sec 17.2289 GB/s
uint64_t 41959360000 0.82312 sec 12.739 GB/s
nate@sandybridge:~/tmp$ icpc -O3 -march=native -std=c++11 popcnt-dependency.cpp -o popcnt-dependency
nate@sandybridge:~/tmp$ popcnt-dependency 1
unsigned 41959360000 0.182781 sec 57.3679 GB/s
uint64_t 41959360000 0.182638 sec 57.4128 GB/s
nate@haswell:~/tmp$ g++ -O3 -march=native -std=c++11 popcnt-dependency.cpp -o popcnt-dependency
nate@haswell:~/tmp$ popcnt-dependency 1
unsigned 41959360000 0.401225 sec 26.1343 GB/s
uint64_t 41959360000 0.75841 sec 13.826 GB/s
nate@haswell:~/tmp$ icpc -O3 -march=native -std=c++11 popcnt-dependency.cpp -o popcnt-dependency
nate@haswell:~/tmp$ popcnt-dependency 1
unsigned 41959360000 0.0843861 sec 124.259 GB/s
uint64_t 41959360000 0.0842836 sec 124.41 GB/s
That would be incredible if true! But I think it's a bug, since the inner loop looks far too short and doesn't seem to be repeating the popcnt's. I'm not sure yet if it's a problem with the compiler or if the test case is abusing something undefined.