The author wants the best possible performance as long as its result is reproducible. I think this is evident from the fact that the non-SIMD code uses 4-wide vectors, so the author is definitely willing to trade accuracy if higher performance with a
reproducible result is possible.
> Re 4-8x -- the large option in xsum was benchmarked at less than 2x the cost of a direct sum. Not so bad?
I don't know where you did take that number, because xsum-paper.pdf clearly indicates larger performance difference. I'm specifically looking at the ratio between the minimum of any superaccumulator results and the minimum of any simple sum results, and among results I think relevant today (x86-64, no earlier than 2012), AMD Opteron 6348 is the only case where the actual difference is only about 1.5x and everything else hovers much higher.