Coroutines / Goroutines and the like are probably better on I/O bound tasks where the CPU-effort in task-switching is significant.
--------
For example: Matrix Multiplication is better with a Threadpool. Handling 1000 simultaneous connections when you get Slashdotted (or "Hacker News hug of death") is better solved with coroutines.
Coroutines MIGHT be more efficient if what you end up building is a statemachine anyways (as that's what most of those coroutines are doing with the compiler). Otherwise, if it's just pure parallel CPU/memory burning with little state transitions/dependence then a dedicated CPU pool fixed to roughly the number of CPU cores on the box will be the most efficient.
Heck, it can often even yield benefits to "pin" certain tasks to a thread to keep the CPU cache filled with relent data. For example, 4 threads handling the 4 quadrants of the matrix rather than having the next available thread picking up the next task.
Its I/O to send data to and from a GPU, and therefore its an I/O bound task somewhat. But there's also a significant amount of CPU work involved. Ideally, you want to balance CPU-work and GPU-work to maximize the work being done.
Fortunately, CUDA-streams seems like they'd mesh pretty well with coroutines (if enough code were there to support them). But if you're reaching for the "GPU-button", everything is compute-bound (if not, you're "doing it wrong"). So now you have a question of "how much to oversubscribe?"
Then again, that's why you just make the oversubscription-factor a #define and then test a lot to find the right factor.... EDIT: Or maybe you oversubscribe until the GPU / CPU runs out of VRAM / RAM. Oversubscription isn't really an issue with coroutines that are executed inside of a thread-pool: you aren't spending any CPU-time needlessly task-switching.
For a lot of the programming I do (and I'm sure a lot of others on HN) IO is almost all network IO. For that, because it's so slow and everything is working over DMA anyways, coroutines end up working really well.
However, once you start talking about on system resources such as SSDs or the GPU, it gets more tricky. As you rightly point out, the GPU is especially bad because all GPU communication ends up being routed through the CPU. At least for a HD, there's DMA which cuts down on the amount of CPU work that needs to be done to access a bit of data.
The big headaches with stackfull co-routine based user mode threading come from two sources. One is allocating the stack. If your language requires a contiguous stack then you either need to make the stacks small, and risk running out, or make them big which can be a problem on 32-bit platforms (you can run out of address space), or can be a problem on some platforms (those with strict commit-charge based memory accounting). Both can be mitigated by allowing non-contiguous stacks or re-locatable contiguous stacks (to allow small stacks to grown later without headaches), although obviously that can have performance considerations.
The other stackfull co-routine headache is in calling into code from another language (i.e. FFI) which could be making direct blocking system calls, and end up starving you of your OS threads.
I do agree that in purely CPU or memory bound applications a classical thread pool makes better sense. The main advantages of either type of co-routine based user mode threading primarily apply to IO-heavy or mixed workloads.