I am also not sure what CPU this is. On recent AMD processors at the very least, it should be impossible to get FMA throughput that is 50 times higher from L1 cache bandwidth than system memory bandwidth. On the Ryzen 7 9800X3D for example, a single core is limited to about 64GB/sec. 50 times more would be 3.2TB/sec, which is ~5 times faster than possible to load from L1 cache even with 2 AVX-512 loads per cycle.
I wonder if you are describing some sort of GEMM routine, which is a place where 50 times more FMA throughput is possible if you do things in a clever way. GEMM is somewhat weird, since without copying to force things into L1 cache, it does not run at full speed, and memory bandwidth from RAM is always below peak memory bandwidth, even without the memcpy() trick to force things into L1 cache. That excludes the case where you stuff GEMV in GEMM, where it does become memory bandwidth bound.