But for all the crazy optimizations modern compiler do, I don't see how marking pointers for more then couple of them in a raw is that crazy
And unrolling loop for traversing linked lists can be done, if you use a sentinel node instead of nullptr to signal then end:
beqz a0, .end
.loop:
ld a1, 0(a0) ; a1 = curr->data
ld a0, 8(a0) ; curr = curr->next
; do something with payload in a1 here
bnez a1, .loop
.end:
becomes la s1, sentinel
beq a0, s1, .end
ld a1, 0(a0)
ld a2, 8(a0)
ld a3, 0(a2)
ld a4, 8(a2)
ld a5, 0(a4)
ld a6, 8(a4)
beq a6, s1, .trail
.loop:
ld t0, 0(a6)
ld t1, 8(a6)
ld t2, 0(t1)
ld t3, 8(t1)
ld t4, 0(t3)
ld t5, 8(t3)
; do something with three payloads in a1, a3, a5 here
mv a1, t0
mv a2, t1
mv a3, t2
mv a4, t3
mv a5, t4
mv a6, t5
bne t5, s1, .loop
mv a0, a2
beq a2, s1, .end
.trail:
ld a1, 0(a0)
ld a0, 8(a0)
; do something with payload in a1 here
bne a0, s1, .trail
.end:
As you can see, "ld t3, 8(a2)" is almost right after to "ld t1, 8(a6)", with intervening load from 0(a2), so prefetch won't noticeably help here, and if the address that ends up in t3 is not in the cache, then "ld t5, 8(t3)" will stall no matter what. And moving the speculative loads up in the loop body before processing the payloads (using even more registers, as you can see) somewhat hurts the latency of processing the first three payloads.Oh, and if you want to see something really crazy, look at e.g. splitting the branch instruction into prediction and resolution instructions [0].
[0] https://zilles.cs.illinois.edu/papers/branch_vanguard_isca_2...