We're still fairly early in the project, so for now threading is the only supported way.
We can do better, however, and we're working on ways to leverage the hardware better (for example, if you have no data-dependent choices in your model we can enqueue kernels in parallel on all GPUs in your machine at once from a single python thread, which will perform much better than explicit python multithreading).
Stay on the lookout as we release new experimental APIs to leverage multiple GPUs and multiple machines.