If you’re making a digital filter on FPGA you are going to be optimizing your structure with a DIF-FFT to produce out of order results followed by DIT-iFFT which accepts the out of order data. The arithmetic irregularities mentioned in the article about the DIF and split-radix structure don't factor in the same when you control the hardware, the complex multiply is implemented with 3muls, and 5adds and twiddles are better computed than wasting transistors to store them.
Using a real-to-complex FFT is really significant for performance and important to start with, as it places some additional constraints on the main FFT. In particular, the butterfly needed in the r2c and c2r passes isn't very amenable to working in bit-reversed order, so the trick of processing frequency domain in bit reversed order doesn't necessarily work. It's also important for comparison against the Fast Hartley Transform, which looks good performance-wise against a complex FFT but not against a real FFT.
I also found that radix-4 performed better than split-radix or conjugate pair FFT with SSE2/AVX SIMD. Both the instruction and data flow is cleaner, and the CPU has an easier time flooding the FMA units with simple loops than the more chaotic data flow of SRFFT/CPFFT. An FMA-based radix-4 loop can easily keep the FMA units at >95% utilization.
For data ordering, the vector-interleaved format mentioned is indeed great for the main passes, but real/imag interleaved turns out to have some benefits for the smallest butterflies. What worked best in my case was to do the deinterleave/transpose as part of an initial radix-8 pass that also handled the bit reversal a cache line at a time.