You could just give the TLDR: by far the biggest improvement in the different generations of nVidia chips is calculating faster at half the accuracy. For blackwell vs hopper it was "double performance". By which they mean blackwell can calculate with NXFP4 at twice the rate hopper can calculate at FP8. Then go back generations all the way until you arrive at FP64, where we started. They even made a slight detour to "FP128".
Decide for yourself if this is a real improvement. You should probably consider that nVidia did not just give the new chips, but also demonstrated training a neural net with NXFP4.
It's not the only improvement, but it is by far the biggest.
As for the future: nobody's gotten FP2 to work satisfactorily yet. But hey, maybe at nVidia's next conference. But, even NXFP4 is not actually 4 bits (meaning various parts of the computation don't actually happen at 4 bits), and neither was FP8 (you could use it like that but people didn't)