undefined | Better HN

0 pointsamir_karbasi1y ago0 comments

I believe that the issue is that graphic cards require really fast memory. This requires close memory placement (that's why the memory is so close to the core on the board). expandable memory will not be able to provide the required bandwidth here.

0 comments

frognumber1y ago

The universe used to have hierarchies. Fast memory close, slow memory far. Registers. L1. L2. L3. RAM. Swap.

The same thing would make a lot of sense here. Super-fast memory close, with overflow into classic DDR slots.

As a footnote, going parallel also helps. 8 sticks of RAM at 1/8 the bandwidth each is the same as one stick of RAM at 8x the bandwidth, if you don't multiplex onto the same traces.

riskable1y ago

It's not so simple... The way GPU architecture works is that it needs as-fast-as-possible access to its VRAM. The concept of "overflow memory" for a GPU is your PC's RAM. Adding a secondary memory controller and equivalent DRAM to the card itself would only provide a trivial improvement over, "just using the PC RAM".

Point of fact: GPUs don't even use all the PCI Express lanes they have available to them! Most GPUs (even top of the line ones like Nvidia's 4090) only use about 8 lanes of bandwidth. This is why some newer GPUs are being offered with M.2 slots so you can add an SSD (https://press.asus.com/news/asus-dual-geforce-rtx-4060-ti-ss... ).

wongarsu1y ago

GPUs have memory hierarchies too. A 4090 has about 16MB of L1 cache and 72MB of L2 cache, followed by the 24GB of GDDR6 RAM, followed by host ram that can be accessed via PCIe.

The issue is that GPUs are massively parallel. A 4090 has 128 streaming multiprocessors, each executing 128 "threads" or "lanes" in parallel. If each "thread" works on a different part of memory that leaves you with 1kB of L1 cache per thread, and 4.5kB of L2 cache each. For each clock cycle you might be issuing thousands of request to your memory controller for cache misses and prefetching. That's why you want insanely fast RAM.

You can write CUDA code that directly accesses your host memory as a layer beyond that, but usually you want to transfer that data in bigger chunks. You probably could make a card that adds DDR4 slots as an additional level of hierarchy. It's the kind of weird stuff Intel might do (the Phi had some interesting memory layout ideas).

magicalhippo1y ago

Isn't part of the problem that the connectors add too much inductance, making the lines difficult to drive at high speed? Similar issue to distance I suppose but more severe.

j / k navigate · click thread line to collapse