> I wonder if GCC has improved since then.
Yes, it has. I've written a lot of SIMD code and spent a good amount of time reading the compiler assembly output and there has been huge improvement over the last decade.
GCC register allocation wasn't great, then it got better with x86 SSE but still sucked at ARM NEON, and now it seems to be decent with both.
Clang was better at SIMD code before GCC was. It was equally good with SSE and NEON.
In my experience, compilers are much better than humans at instruction scheduling. Especially when using portable vector extensions, you don't have to write the same code twice and then tweak the scheduling for every architecture separately.