It has always been a bad idea to dispatch so naively
and dispatch to the same number of threads as you have cores. What if a couple cores are busy, and you spend almost twice as much time as you need waiting for the calculation to finish? I don't know how much software does that, and most of it can be easily fixed to dispatch half a million rows at a time and get better performance on all computers.
Also on current CPUs it'll be affected by hyperthreading and launch 28 threads, which would probably work out pretty well overall.