In my approach, these would be 2 completely independent streams. I haven't implemented audio yet, but hypothetically you can continuously adjust the sample buffer size of the audio stream to be within some safety margin of detected peak latency, and things should self-synchronize pretty well.
In terms of encoding the audio, I don't know that I would. For video, going from MPEG->JPEG brought the perfect trade-off. For reducing audio latency, I think you would just need to be sending raw PCM samples as soon as you generate them. Maybe in really small batches (in case you have a client super-close to the server and you want virtually 0 latency). If you use small batches of samples you could probably start thinking about MP3, but raw 44.1KHz 16-bit stereo audio is only 1.44 mbps. Most cellphones wouldn't have a problem with that these days.
Edit: The fundamental difference in information theory regarding video and audio is the dimensionality. JPEG makes sense for video, because the smallest useful unit of presentation is the individual video frame. For audio, the smallest useful unit of presentation is the PCM sample, but the hazard is that these are fed in at a substantially higher rate (44k/s) than with video (60/s), so you need to buffer out enough samples to cover the latency rift.