undefined | Better HN

0 pointsryao1y ago0 comments

The purpose of L1 cache is to avoid long round trips to memory. What you describe is L1 cache doing what it is intended to do. Unfortunately, I do not have your code, so it is not clear to me that it benefits from doing 2 AVX-512 loads per cycle.

I am also not sure what CPU this is. On recent AMD processors at the very least, it should be impossible to get FMA throughput that is 50 times higher from L1 cache bandwidth than system memory bandwidth. On the Ryzen 7 9800X3D for example, a single core is limited to about 64GB/sec. 50 times more would be 3.2TB/sec, which is ~5 times faster than possible to load from L1 cache even with 2 AVX-512 loads per cycle.

I wonder if you are describing some sort of GEMM routine, which is a place where 50 times more FMA throughput is possible if you do things in a clever way. GEMM is somewhat weird, since without copying to force things into L1 cache, it does not run at full speed, and memory bandwidth from RAM is always below peak memory bandwidth, even without the memcpy() trick to force things into L1 cache. That excludes the case where you stuff GEMV in GEMM, where it does become memory bandwidth bound.

0 comments

janwas1y ago

The code is unfortunately not (yet) open source. The CPU with 50x is an SKX Gold, and it is similar for Zen4. I compute this ratio as #FMA * 4 / total system memory bandwidth. We are indeed not fully memBW bound :)

menaerus1y ago

I'd be curious if you measured 50x on a single core implementation or is the algorithm distributed to multiple cores?

I ask because you say that the results are similar to Zen4 so this would sorta imply that you run and measure single-core implementation? Intel in multi-core load-store looses a lot of bandwidth when compared to Zen3/4/5 since there's a lot of contention going on due to Intel cache architecture.

j / k navigate · click thread line to collapse

0 comments

janwas1y ago

menaerus1y ago

I'd be curious if you measured 50x on a single core implementation or is the algorithm distributed to multiple cores?

j / k navigate · click thread line to collapse