GPUs have memory hierarchies too. A 4090 has about 16MB of L1 cache and 72MB of L2 cache, followed by the 24GB of GDDR6 RAM, followed by host ram that can be accessed via PCIe.
The issue is that GPUs are massively parallel. A 4090 has 128 streaming multiprocessors, each executing 128 "threads" or "lanes" in parallel. If each "thread" works on a different part of memory that leaves you with 1kB of L1 cache per thread, and 4.5kB of L2 cache each. For each clock cycle you might be issuing thousands of request to your memory controller for cache misses and prefetching. That's why you want insanely fast RAM.
You can write CUDA code that directly accesses your host memory as a layer beyond that, but usually you want to transfer that data in bigger chunks. You probably could make a card that adds DDR4 slots as an additional level of hierarchy. It's the kind of weird stuff Intel might do (the Phi had some interesting memory layout ideas).