From a SIMD perspective, it’s worth noting that on most platforms, the cost of computing one square root or two is the same. On modern x86 server CPUs, for instance, you can calculate up to 8 double-precision roots in parallel with identical latency. So there’s no additional cost in terms of performance.
I hope this sheds some light on the design of my code.
PS: In a previous life, I did research in Astro- and Plasma Physics. While I don’t claim to remember all the Math, it’s usually more productive to ask for clarification than to assume ignorance ;)