Because if you're issuing a bogus pre-fetch, you can't cancel it, can you? So that's 90 or something cycles that's the fetch for your
actual data is being delayed. Pointer chasing already strains the memory bandwidth, trying to request even more data from memory will only worsen things.
And unrolling loop for traversing linked lists can be done, if you use a sentinel node instead of nullptr to signal then end:
beqz a0, .end
.loop:
ld a1, 0(a0) ; a1 = curr->data
ld a0, 8(a0) ; curr = curr->next
; do something with payload in a1 here
bnez a1, .loop
.end:
becomes
la s1, sentinel
beq a0, s1, .end
ld a1, 0(a0)
ld a2, 8(a0)
ld a3, 0(a2)
ld a4, 8(a2)
ld a5, 0(a4)
ld a6, 8(a4)
beq a6, s1, .trail
.loop:
ld t0, 0(a6)
ld t1, 8(a6)
ld t2, 0(t1)
ld t3, 8(t1)
ld t4, 0(t3)
ld t5, 8(t3)
; do something with three payloads in a1, a3, a5 here
mv a1, t0
mv a2, t1
mv a3, t2
mv a4, t3
mv a5, t4
mv a6, t5
bne t5, s1, .loop
mv a0, a2
beq a2, s1, .end
.trail:
ld a1, 0(a0)
ld a0, 8(a0)
; do something with payload in a1 here
bne a0, s1, .trail
.end:
As you can see, "ld t3, 8(a2)" is almost right after to "ld t1, 8(a6)", with intervening load from 0(a2), so prefetch won't noticeably help here, and if the address that ends up in t3 is not in the cache, then "ld t5, 8(t3)" will stall no matter what. And moving the speculative loads up in the loop body before processing the payloads (using even more registers, as you can see) somewhat hurts the latency of processing the first three payloads.
Oh, and if you want to see something really crazy, look at e.g. splitting the branch instruction into prediction and resolution instructions [0].
[0] https://zilles.cs.illinois.edu/papers/branch_vanguard_isca_2...