This is the result of that question itself and some weekend vibecoding (it has the linked library repository in the readme as well), it seems to work, even on consumer gpus, it should work better on professional ones tho
I wonder... what if the m.2 storage was actually DRAM? You probably don't need persistence for spilling a model off the GPU. How would it fare vs just adding more host memory? The m.2 ram would be less flexible, but would keep the system ram free for the CPU.
I gave a talk a few years ago at dask summit (conf?) on making the stars align with dask-cudf here. We were helping a customer accelerate log analytics by proving out our stack for nodes that look roughly like: parallel ssd storage arrays (30 x 3 GB/s?) -> GPUDirect Storage -> 4 x 30 GB/s PCIe (?) -> 8 x A100 GPUs, something like that. It'd be cool to see the same thing now in the LLM world, such as a multi-GPU MoE, or even a single-GPU one for that matter!
https://www.servethehome.com/hyper-scalers-are-using-cxl-to-...
Do you have $2/hr to rent an RTX 6000 96GB or $5/hr for B200 180GB on the cloud?
But 5 seconds / token is quite slow yeah. I guess this is for low ram machines? I'm pretty sure my 5950x with 128 gb ram can run this faster on the CPU with some layers / prefill on the 3060 gpu I have.
I also see that they claim the process is compute bound at 2 seconds/token, but that doesn't seem correct with a 3090?
DDR4 tops out about 27Gbs
DDR5 can do around 40Gbs
So for 70B model at 8 bit quant, you will get around 0.3-0.5 tokens per second using RAM alone.
One workup indicated it was theoretically possible to modify a piece of SGLang's routing layer to support JIT predict-ahead expert swaps from Gen5 NVMe storage straight into GPU memory.
I'm hoping that proves true. The setup relies on NVIDIA Dynamo, so NIXL primitives are available to support that.
Curious if anyone's tried this already.
And that's good because that increases democratization of AI away from the silos that are being created.
I know you said you're involved in some retrogaming and were experimenting, but as someone who works in a world where hardware is pretty heavily abstracted away, even if I got into retrogaming I don't know that I'd consider that there may be a systems improvement lying around. Beyond the creative aspect, it feels like there is some systems and hardware background that helped put the idea together (and I'd be interested to go learn about of that systems/hardware knowledge myself).
The idea was basically to run a llm on a ps2, then I ran into some problems as the 32mb ram cap with 4mb vram cap; so I had to figure out a way to stream layers on the forward pass. Given that ps2 manages to give instructions directly to the vram that's capable of 32bit addresses, it gave an insane amount of tok/s, then I wondered if I could do the same on my puter
Perhaps that's what made them think to try.
Perhaps the current batch of smart memory cards which on the PS2 I believe have quite complex DMA capabilities to stream from the SD card game data.
I’ve also wondered why the routers aren’t training to be serially consistent so you can predict layers to swap into VRAM a few layers ahead to maximize available bandwidth.
Unless you're handing that in some kind of fancy way, you'll be holding up the batch while waiting for host memory which will kill your throughout.
It makes much more sense for non batched local inference, especially if you can keep the MoE routing stable like you say, but most folks aren't optimising for that.
1) This is basically the intention of several recent MoE models: keep particular generally useful experts hot in VRAM.
2) Unless you can swap layers in faster than you consume them there is no point to predicting layers (what does this even really mean? did you mean predicting experts?).
It seems at the moment the best you can do is keep experts and layers more likely to be used for a given query in VRAM and offload the rest, but this is work-dependent.
Nice work. PCI-P2P (GPU-Direct (tm)) is such great stuff. Cool to see!
- https://taalas.com/the-path-to-ubiquitous-ai/ - https://chatjimmy.ai/