We didn't actually try LSTMs, because we train in 1.25 second chunks, so running an LSTM for several hundred timesteps would drastically slow down training. Our per iteration time was in the 200-500 milliseconds, and using an LSTM or GRU would likely bump that into the 1-3 second range, maybe more, whereas the QRNN conditioning actually make it cheaper than the transposed convolution conditioning by 20-40%.
The upsampling procedure is quite finicky, so we had quite a few iterations there, but we didn't have to tune hyperparameters too much of the QRNN itself. Once we implemented the QRNN in CUDA for TensorFlow and got it to train, it worked without too much trouble.
Our collaborators in Beijing mentioned that bidirectional LSTMs also worked in a similar way, though.