undefined | Better HN

0 pointseliasdejong2mo ago0 comments

The limit is the number of outstanding cache line requests to the memory controller. CPUs have a fixed number of slots for this, around 10-12 usually. Intel calls them LFBs (Line Fill Buffers) and AMD MSHRs (Miss Status Holding Registers). When the slots are filled, the CPU can issue no more requests and has to wait for them to complete. Apple M chips (probably) have more slots and the memory is physically packaged together with the CPU, so they get better numbers.

0 comments

foota2mo ago

I assume these must be really expensive? Otherwise it seems like a great way to improve throughput on low concurrency tasks.

IgorPartola2mo ago

At least in older CPUs the caches were SRAM (static RAM). It is complicated but requires no refreshing. DRAM is basically just a capacitor per bit and capacitors leak so you constantly have to refresh the entire memory space. When the CPU sends a request to RAM, the memory controller might be too busy refreshing the soon to decay parts to actually respond right away. And if I recall correctly when you read from DRAM you destroy what was there so the process is to read it, then write it back, then send the answer to the CPU which is just a lot of steps. But the price and die size difference is huge so we use GB or TB levels of DRAM and MB levels of SRAM.

foota2mo ago

Wouldn't this bound the overall memory bandwidth, not the per core bandwidth? I've sort of assumed that just providing more line fill buffers wouldn't be sufficient, and that the number of LFB is chosen in tandem with a number of other things, but I'm not sure what the other things are (that is, just increasing the # of LFB might not be meaningful without also increasing XYZ).

1 more reply

memoriuaysj2mo ago

bus wires. you can route only so many of them on a motherboard.

it's why GPUs have their memory chips in a circle around the GPU chip.

foota2mo ago

Wouldn't this be the limiting factor moreso for overall throughput, not per core? I believe with Zen 4 for instance it goes through a central memory controller.

1 more reply

j / k navigate · click thread line to collapse

0 pointseliasdejong2mo ago0 comments

0 comments

foota2mo ago

I assume these must be really expensive? Otherwise it seems like a great way to improve throughput on low concurrency tasks.

IgorPartola2mo ago

foota2mo ago

1 more reply

memoriuaysj2mo ago

bus wires. you can route only so many of them on a motherboard.

it's why GPUs have their memory chips in a circle around the GPU chip.

foota2mo ago

Wouldn't this be the limiting factor moreso for overall throughput, not per core? I believe with Zen 4 for instance it goes through a central memory controller.

1 more reply

j / k navigate · click thread line to collapse