Back in 2012 I did some building on ZSpace[0] and the that setup still feels like the sweet spot: Physical keyboard, physical pen (with virtual extension), physical glasses to detect head position for parallax, 3D environment to play and create in.
The _visual_ feedback from moving your head and rotating objects with the pen were extremely low-latency. Gesture detection is still nowhere near that level of fidelity but with peripherals, perhaps it's not necessary.
[0] https://zspace.com/