undefined | Better HN

0 pointstitzer5y ago0 comments

It's a shame we can't see the rest of the code. What is happening to the result value? Is it being compared to something? Put into an array, or what? All of that code probably totally outweighs what you pointed out here. Or, at least it should. I have a bad feeling it might be being dead-code eliminated, since compilers are super aggressive about that nowadays, but I hope he's somehow controlled for that.

0 comments

brigade5y ago

The blog post links to the benchmark... It's repeatedly populating a 20k entry array with the results.

godbolt clang compiles it to:

    .LBB5_2:                                // =>This Inner Loop Header: Depth=1
        mul     x13, x11, x10
        umulh   x14, x11, x10
        eor     x13, x14, x13
        mul     x14, x13, x12
        umulh   x13, x13, x12
        eor     x13, x13, x14
        str     x13, [x0, x8, lsl #3]
        add     x8, x8, #2                      // =2
        cmp     x8, x1
        add     x11, x11, x9
        b.lo    .LBB5_2

[1] https://github.com/lemire/Code-used-on-Daniel-Lemire-s-blog/...

titzerOP5y ago

Thanks for the link.

Just staring at the machine code, it looks like the hottest loop for wyrng is about 10 instructions with a store in it. If the processor can do that loop in 1 cycle on average then...holy fuck.

edit: I was looking at similar code generated by clang on my machine. Again, holy fuck.

I don't think the story here is that 64x64=128 multiply is fast, honestly. The real story is the insane level of speculation and huge ROB that is necessary to make that non-unrolled loop go so fast. Everything has to go pretty much perfect to achieve that throughput.

pbsd5y ago

Based on the information from [1] we have something like this for both loops:

    .LBB0_2:
            eor     x13, x9, x9, lsr #30     # 2 \* p1-6
            mul     x13, x13, x11            # 1 \* p5-6
            eor     x13, x13, x13, lsr #27   # 2 \* p1-6
            mul     x13, x13, x12            # 1 \* p5-6
            eor     x13, x13, x13, lsr #31   # 2 \* p1-6
            str     x13, [x0, x10, lsl #3]   # 1 \* p7-8
            add     x13, x10, #2             # 1 \* p1-6
            add     x9, x9, x8               # 1 \* p1-6
            mov     x10, x13                 # none
            cmp     x13, x1                  #
            b.lo    .LBB0_2                  # Fused into 1 \* p1-3
                                             # Total: 11 uops

    .LBB1_2:
            mul     x13, x9, x11             # 1 \* p5-6
            umulh   x14, x9, x11             # 1 \* p5-6
            eor     x13, x14, x13            # 1 \* p1-6
            mul     x14, x13, x12            # 1 \* p5-6
            umulh   x13, x13, x12            # 1 \* p5-6
            eor     x13, x13, x14            # 1 \* p1-6
            str     x13, [x0, x10, lsl #3]   # 1 \* p7-8
            add     x13, x10, #2             # 1 \* p1-6
            add     x9, x9, x8               # 1 \* p1-6
            mov     x10, x13                 # none
            cmp     x13, x1                  #
            b.lo    .LBB1_2                  # Fused into 1 \* p1-3
                                             # Total: 10 uops

Purely based on number of uops, there's a slight win for wyhash, all other things being equal. However, I doubt that you're really getting one iteration per second here; there are 6 integer units, and even if you perfectly exploited instruction parallelism you're limited to 6 ALU instructions per cycle, which are less than the extent of either loop. It would be possible if the mul-umulh pairs are getting fused, which would bring it down to 8 uops per iteration.

Taking into account the port distribution, each iteration of wyhash involves 4 uops being dispatched to ports 5 and 6, which means you should be getting at least 2 cycles/iteration purely for the multiplications. If it's much lower than that, the whole multiplication being fused into a single port 5-6 uop might be right.

However I can neither confirm nor deny that the loops behave like that on the M1, as I don't have one.

[1] https://dougallj.github.io/applecpu/firestorm.html

1 more reply

dwaite5y ago

Perhaps there was a misunderstanding when executives talked to the silicon engineers about "the unbelievable speculation for our first in-house desktop-class CPU"?

gpderetta5y ago

I don't think a large rob is a significant contributor here.

The very wide execution is though.

Someone5y ago

I would be wary of using gettimeofday to measure such short periods. As https://pubs.opengroup.org/onlinepubs/009604599/functions/ge... says, the resolution of the system clock is unspecified, and 20,000 × ≈10 ≈ 200k instructions easily run in under a millisecond on modern hardware.

The benchmark probably get rids of that by doing it 40,000 times in quick succession, but why not measure the time of all 40,000 iterations in one go, and decrease the risk (and the overhead of calling gettimeofday 39,999 times)?

kllrnohj5y ago

The resolution of the clock is unspecified by the POSIX standard but that does not mean it's unspecified by the platform this is actually running on. If this was trying to be portable code you'd have a big issue there but it's not. It is still limited by gettimeofday maxing out at microsecond precision, though, which is quite poor. And seems to be using a realtime clock, which introduces errors from network time sync & such. That's unlikely to crop up here, but it's still a risk. clock_gettime_nsec_np(CLOCK_MONOTONIC) is what should be used here.

volta835y ago

Agree, you probably want to just read some cpu time stamp counter before and after, and make sure you ar pinning your threwads, locking the clocks, etc. so that you can get a reliable time from there, or... just use cycles as your unit of measure.

titzerOP5y ago

On Linux and MacOS, gettimeofday is accurate to microseconds. Not only is the returned struct is expressed in microseconds, but I have personally observed that it is accurate to microseconds.

j / k navigate · click thread line to collapse

0 comments

brigade5y ago

The blog post links to the benchmark... It's repeatedly populating a 20k entry array with the results.

godbolt clang compiles it to:

    .LBB5_2:                                // =>This Inner Loop Header: Depth=1
        mul     x13, x11, x10
        umulh   x14, x11, x10
        eor     x13, x14, x13
        mul     x14, x13, x12
        umulh   x13, x13, x12
        eor     x13, x13, x14
        str     x13, [x0, x8, lsl #3]
        add     x8, x8, #2                      // =2
        cmp     x8, x1
        add     x11, x11, x9
        b.lo    .LBB5_2

[1] https://github.com/lemire/Code-used-on-Daniel-Lemire-s-blog/...

titzerOP5y ago

Thanks for the link.

Just staring at the machine code, it looks like the hottest loop for wyrng is about 10 instructions with a store in it. If the processor can do that loop in 1 cycle on average then...holy fuck.

edit: I was looking at similar code generated by clang on my machine. Again, holy fuck.

pbsd5y ago

Based on the information from [1] we have something like this for both loops:

    .LBB0_2:
            eor     x13, x9, x9, lsr #30     # 2 \* p1-6
            mul     x13, x13, x11            # 1 \* p5-6
            eor     x13, x13, x13, lsr #27   # 2 \* p1-6
            mul     x13, x13, x12            # 1 \* p5-6
            eor     x13, x13, x13, lsr #31   # 2 \* p1-6
            str     x13, [x0, x10, lsl #3]   # 1 \* p7-8
            add     x13, x10, #2             # 1 \* p1-6
            add     x9, x9, x8               # 1 \* p1-6
            mov     x10, x13                 # none
            cmp     x13, x1                  #
            b.lo    .LBB0_2                  # Fused into 1 \* p1-3
                                             # Total: 11 uops

    .LBB1_2:
            mul     x13, x9, x11             # 1 \* p5-6
            umulh   x14, x9, x11             # 1 \* p5-6
            eor     x13, x14, x13            # 1 \* p1-6
            mul     x14, x13, x12            # 1 \* p5-6
            umulh   x13, x13, x12            # 1 \* p5-6
            eor     x13, x13, x14            # 1 \* p1-6
            str     x13, [x0, x10, lsl #3]   # 1 \* p7-8
            add     x13, x10, #2             # 1 \* p1-6
            add     x9, x9, x8               # 1 \* p1-6
            mov     x10, x13                 # none
            cmp     x13, x1                  #
            b.lo    .LBB1_2                  # Fused into 1 \* p1-3
                                             # Total: 10 uops

However I can neither confirm nor deny that the loops behave like that on the M1, as I don't have one.

[1] https://dougallj.github.io/applecpu/firestorm.html

1 more reply

dwaite5y ago

Perhaps there was a misunderstanding when executives talked to the silicon engineers about "the unbelievable speculation for our first in-house desktop-class CPU"?

gpderetta5y ago

I don't think a large rob is a significant contributor here.

The very wide execution is though.

Someone5y ago

kllrnohj5y ago

volta835y ago

titzerOP5y ago

On Linux and MacOS, gettimeofday is accurate to microseconds. Not only is the returned struct is expressed in microseconds, but I have personally observed that it is accurate to microseconds.

j / k navigate · click thread line to collapse