In the context of gmp, people write architecture-specific assembly for the inner loop anyway.
Besides that, you raise good points on sources of complexity. I’m waiting for the benchmarks once such developments have been incorporated. Everything else is guesswork.
If they didn't implement those benchmarks (at least in simulation, like they benchmarked everything else) before releasing the spec, then they have nothing but handwaving and wishful thinking in saying this issue can be solved by op fusion. The reality is that they optimized for 1980s-style C programming without noticing that this isn't the 1980s any more.