Audio processing is real-time, which means that you cannot miss your deadline. If you do miss you get audible glitches, whereas in graphics you just get a slowdown. For that reason, audio code is written in a very particular, real-time safe style that avoids locks, allocations, syscalls, and anything else that is not guaranteed to return within a bounded amount time.
How long the deadline is depends on your buffer size and sample rate. To my ear, buffer sizes of >128 samples (at a sample rate of 44.1 KHz) have detectable latency (although the amount of latency will depend also on how many applications are in your signal chain). At 128 samples you have just under 3ms to do your processing.
Also note that for graphics, the output is the GPU itself. So you don't need to wait for the output to move back to the CPU, it's already where it needs to be.