It's not entirely surprising that a carefully-optimized C program using explicit SSE intrinsics, plus a fancy trick involving a low-precision square root instruction fixed up with two iterations of Newton's method, would be fast. :-)
What impresses me is that the Rust version didn't do any of that stuff, just wrote very boring, straightforward code -- and got the same speed anyway. Some impressive compilation there!