undefined | Better HN

0 pointscamel-cdr1y ago0 comments

No it's not. For HPC good software support for the vector extension is basically everything that matters, and the framework main board doesn't support that extension.

I would currently recommend the BananaPI BPI-F3 or the OrangePI RV2 for that purpose, as they both have the same SpacemiT X60 cores, which support the vector extension.

Sadly there are currently only in-order cores with RVV support available. Getting a cheap out-of-order implementation is the next most important thing for improving software support.

Tenstorrent has announced they will release a 8x Ascalon devboard and laptop next year: https://youtu.be/ttQtC1dQqwo?t=1035

0 comments

knorker1y ago

Thank you! I've been waiting for a viable RVV board for a long time. Just ordered the OrangePi RV2.

This unblocks me properly working to optimize for vector support in software. OOO and even wider RVV registers will then automatically speed things up, without even a recompile.

Yes, I know I could use qemu, but it's not the same. I feel like this is what unblocks me on the software side.

camel-cdrOP1y ago

> OOO and even wider RVV registers will then automatically speed things up, without even a recompile.

The problem is that there are some things in RVV where it's unclear how they will perform on high perf OoO cores:

* general choice of LMUL: on in-order cores it's clear that maximizing LMUL without spilling is the best approach, for OoO this isn't clear.

* How will LMUL>1 vrgather and vcompress perform?

* How high is the impact of vsetvli instructions? Is it worth trying to move them outside of loops whenever possible, or is the impact minimal like in the current in-order implementations.

* What is the overhead of using .vx instruction variants, is there additional cost involved in moving between GPRs and vector registers?

* Is there additional overhead when reinterpreting vector masks?

* What performance can we expect from the more complex load/stores, especially the segmented ones.

The LLVM scheduling models give some insight:

* SiFive P670: https://github.com/llvm/llvm-project/blob/main/llvm/lib/Targ...

* Tenstorrent Ascalon: https://github.com/llvm/llvm-project/blob/main/llvm/lib/Targ... (still missing the vector part, but there is supposed to be a PR in the near future)

I'm trying to collect as much info on hardware as I can: https://camel-cdr.github.io/rvv-bench-results/index.html

j / k navigate · click thread line to collapse