You need to quantize 70b to run it on that kind of hardware as well, since even float16 wouldn't fit. But 405b:IQ1_M seems to be smarter than 70b:Q4_K_M in my experiments (admittedly very limited because it's so slow).
Note that IQ1_M quants are not really "1-bit" despite the name. It's somewhere around 1.8bpw, which just happens to be enough to fit the model into 128Gb with some room for inference.