> First: if you have an epoll loop it is also the cost of the thread context switch, which has definitely us in RPC systems using kernel threads. By contrast the goroutine gets scheduled onto the kernel thread that answered the poll, saving the switch.
I'm not comparing M:N to a 1:1 system where all I/O is proxied out to another thread sitting in an epoll loop. I'm comparing M:N to 1:1 with blocking I/O. In this scenario, the kernel switches directly onto the appropriate thread.
> Second: as I alluded to earlier, linux and solaris can scale their kernel thread implementations, not all OSs can.
The vast majority of Go users are running Linux. And on Windows, UMS is 1:1 and is the preferred way to do high-performance servers; it avoids a lot of the problems that Go has (for instance, playing nicely with third-party code).
> Third: you can only adjust stack sizes down if you know your program always keeps its stacks small.
You could do 1:1 with stack growth just as Go does. As I've said before, small stacks are a property of the relocatable GC, not a property of the thread implementation.
> If all this were as easy as you say, we would still write nearly all our C/C++ servers using threads.
We don't write C/C++ servers using threads because (1) stackless use of epoll is faster than both 1:1 threading and M:N threading, as this project shows; (2) C/C++ can't do relocatable stacks, as the language is hostile to precise moving GC.