Napkin math: 7x(4/8)x8 is 28GB, and q4 uses a little more than just 4 bits per param, and there’s extra overhead for context, and the FFN to select experts is probably more on top of that.
It would probably fit in 32GB at 4-bit but probably won’t run with sensible quantization/perf on a 3090/4090 without other tricks like offloading. Depending on how likely the same experts are to be chosen for multiple sequential tokens, offloading experts may be viable.