> You'd probably want to have a new __riscv_vsetvlmax_e8m8 at the start of each loop iteration, as otherwise an earlier iteration could cut off the vl (e.g. page unloaded by the OS), and thus the loop continues with the truncated vl.
Oh, yeah, that was a big oversight, unfortunately, this didn't undo the performance regression.
> The normal load should just segfault if any loaded byte is outside of readable memory, same as with a scalar load which is similarly partly outside.
I don't quite understand how that plays out.
The reference memcpy implementation uses `vle8.v` and the reference strlen implementation uses `vle8ff.v`.
I think I understand how it works in strlen, but why does memcpy work without the ff? Does it just skip the instruction, or repeat it? Because in either case, shouldn't `vle8.v` work with strlen as well? There must be another option, but I can't think of any.
Also, does this mean I can get the original performance back, if I make sure to page align my pointers and use `vle8.v`?