The limit is the number of outstanding cache line requests to the memory controller. CPUs have a fixed number of slots for this, around 10-12 usually. Intel calls them LFBs (Line Fill Buffers) and AMD MSHRs (Miss Status Holding Registers). When the slots are filled, the CPU can issue no more requests and has to wait for them to complete. Apple M chips (probably) have more slots and the memory is physically packaged together with the CPU, so they get better numbers.
At least in older CPUs the caches were SRAM (static RAM). It is complicated but requires no refreshing. DRAM is basically just a capacitor per bit and capacitors leak so you constantly have to refresh the entire memory space. When the CPU sends a request to RAM, the memory controller might be too busy refreshing the soon to decay parts to actually respond right away. And if I recall correctly when you read from DRAM you destroy what was there so the process is to read it, then write it back, then send the answer to the CPU which is just a lot of steps. But the price and die size difference is huge so we use GB or TB levels of DRAM and MB levels of SRAM.
Wouldn't this bound the overall memory bandwidth, not the per core bandwidth? I've sort of assumed that just providing more line fill buffers wouldn't be sufficient, and that the number of LFB is chosen in tandem with a number of other things, but I'm not sure what the other things are (that is, just increasing the # of LFB might not be meaningful without also increasing XYZ).
Wouldn't this be the limiting factor moreso for overall throughput, not per core? I believe with Zen 4 for instance it goes through a central memory controller.