undefined | Better HN

0 pointsXeamek2y ago0 comments

For everything? Obviously no.

But for all the crazy optimizations modern compiler do, I don't see how marking pointers for more then couple of them in a raw is that crazy

0 comments

Joker_vD2y ago

Because if you're issuing a bogus pre-fetch, you can't cancel it, can you? So that's 90 or something cycles that's the fetch for your actual data is being delayed. Pointer chasing already strains the memory bandwidth, trying to request even more data from memory will only worsen things.

And unrolling loop for traversing linked lists can be done, if you use a sentinel node instead of nullptr to signal then end:

        beqz    a0, .end
    .loop:
        ld      a1, 0(a0)   ; a1 = curr->data
        ld      a0, 8(a0)   ; curr = curr->next
        ; do something with payload in a1 here
        bnez    a1, .loop
    .end:

becomes

        la      s1, sentinel
        beq     a0, s1, .end
        ld      a1, 0(a0)
        ld      a2, 8(a0)
        ld      a3, 0(a2)
        ld      a4, 8(a2)
        ld      a5, 0(a4)
        ld      a6, 8(a4)
        beq     a6, s1, .trail
    .loop:
        ld      t0, 0(a6)
        ld      t1, 8(a6)
        ld      t2, 0(t1)
        ld      t3, 8(t1)
        ld      t4, 0(t3)
        ld      t5, 8(t3)
        ; do something with three payloads in a1, a3, a5 here
        mv      a1, t0
        mv      a2, t1
        mv      a3, t2
        mv      a4, t3
        mv      a5, t4
        mv      a6, t5
        bne     t5, s1, .loop
        mv      a0, a2
        beq     a2, s1, .end
     .trail:
        ld      a1, 0(a0)
        ld      a0, 8(a0)
        ; do something with payload in a1 here
        bne     a0, s1, .trail
     .end:

As you can see, "ld t3, 8(a2)" is almost right after to "ld t1, 8(a6)", with intervening load from 0(a2), so prefetch won't noticeably help here, and if the address that ends up in t3 is not in the cache, then "ld t5, 8(t3)" will stall no matter what. And moving the speculative loads up in the loop body before processing the payloads (using even more registers, as you can see) somewhat hurts the latency of processing the first three payloads.

Oh, and if you want to see something really crazy, look at e.g. splitting the branch instruction into prediction and resolution instructions [0].

[0] https://zilles.cs.illinois.edu/papers/branch_vanguard_isca_2...

j / k navigate · click thread line to collapse