i suspect that you could do fp8 with log tables and interpolation if you really wanted to (compared to the memory required for the model it's peanuts), it just turns into a LUT (log table look up) and bit shift (interpolation). so again, memory bandwidth is the limiting factor for transformers (as far as energy is concerned).
This time though LUT exists in a circuit, which is much more efficient than typical memory lookup. Such LUT would have to exist per each ALU though, so it can't be too large.