It would probably fit in 32GB at 4-bit but probably won’t run with sensible quantization/perf on a 3090/4090 without other tricks like offloading. Depending on how likely the same experts are to be chosen for multiple sequential tokens, offloading experts may be viable.
Its theoretically doable, with quantization from the recent 2 bit quant paper and a custom implementation (in exllamav2?)
EDIT: Actually the download is much smaller than 8x7B. Not sure how, but its sized more like a 30B, perfect for a 3090. Very interesting.