On AMD hardware, I don't understand why people avoid AMD's support, which is just a version of BLIS and libflame.A year ago, I benchmarked a transformer network with libtorch linked against various BLAS libraries (numbers are in sentences per second, MKL with CPU detection override on AMD, 4 threads):
Ryzen 3700X - OpenBLAS: 83, BLIS: 69, AMD BLIS: 80, MKL: 119
Xeon Gold 6138 - OpenBLAS: 88, BLIS: 52, AMD BLIS: 59, MKL: 128
I guess people avoid AMD's support, because MKL is just much faster? AMD BLIS did add batch GEMM support since then. Didn't have time to try that out yet.