> OOO and even wider RVV registers will then automatically speed things up, without even a recompile.
The problem is that there are some things in RVV where it's unclear how they will perform on high perf OoO cores:
* general choice of LMUL: on in-order cores it's clear that maximizing LMUL without spilling is the best approach, for OoO this isn't clear.
* How will LMUL>1 vrgather and vcompress perform?
* How high is the impact of vsetvli instructions? Is it worth trying to move them outside of loops whenever possible, or is the impact minimal like in the current in-order implementations.
* What is the overhead of using .vx instruction variants, is there additional cost involved in moving between GPRs and vector registers?
* Is there additional overhead when reinterpreting vector masks?
* What performance can we expect from the more complex load/stores, especially the segmented ones.
The LLVM scheduling models give some insight:
* SiFive P670: https://github.com/llvm/llvm-project/blob/main/llvm/lib/Targ...
* Tenstorrent Ascalon: https://github.com/llvm/llvm-project/blob/main/llvm/lib/Targ... (still missing the vector part, but there is supposed to be a PR in the near future)
I'm trying to collect as much info on hardware as I can: https://camel-cdr.github.io/rvv-bench-results/index.html