By setting the thread stack size to a reasonable value. That's it. And, in fact, on 64-bit you often don't even need to do that.
The difference you're describing is a difference in default thread stack sizes, which is hardly a paradigm shift. We're talking about one call to pthread_attr_setstacksize().
First: if you have an epoll loop it is also the cost of the thread context switch, which has definitely us in RPC systems using kernel threads. By contrast the goroutine gets scheduled onto the kernel thread that answered the poll, saving the switch.
Second: as I alluded to earlier, linux and solaris can scale their kernel thread implementations, not all OSs can. My experiences with large numbers of threads on the BSDs and Windows (in years past admittedly) suggest other kernels don't have thread implementations designed to scale to such high numbers. Solving the problem in userspace means Go programs written in this style are portable across operating systems.
Third: you can only adjust stack sizes down if you know your program always keeps its stacks small. If you depend on libraries you don't own in C/C++, that's a difficult assumption. Go grows the stacks, so if you hit some corner case where a small number of goroutines need some significant amount of stack, your program uses more memory, but typically keeps working. No need for careful (manual!) stack accounting.
If all this were as easy as you say, we would still write nearly all our C/C++ servers using threads. We don't because it's not.
I'm not comparing M:N to a 1:1 system where all I/O is proxied out to another thread sitting in an epoll loop. I'm comparing M:N to 1:1 with blocking I/O. In this scenario, the kernel switches directly onto the appropriate thread.
> Second: as I alluded to earlier, linux and solaris can scale their kernel thread implementations, not all OSs can.
The vast majority of Go users are running Linux. And on Windows, UMS is 1:1 and is the preferred way to do high-performance servers; it avoids a lot of the problems that Go has (for instance, playing nicely with third-party code).
> Third: you can only adjust stack sizes down if you know your program always keeps its stacks small.
You could do 1:1 with stack growth just as Go does. As I've said before, small stacks are a property of the relocatable GC, not a property of the thread implementation.
> If all this were as easy as you say, we would still write nearly all our C/C++ servers using threads.
We don't write C/C++ servers using threads because (1) stackless use of epoll is faster than both 1:1 threading and M:N threading, as this project shows; (2) C/C++ can't do relocatable stacks, as the language is hostile to precise moving GC.
Second, almost all the event-driven C++ servers I have seen are written that way not for performance, but for scaling and latency. There is usually plenty of extra CPU and RAM, only a tiny fraction really bump up against resource limits. (A typical case of the vast majority of code not being performance sensitive.)
Otherwise, I agree with your points in this comment. Especially the broader point that there's no novel component of Go. Go is about combining well-known things together.
However, it seems to me that Go still cuts through the "threads vs. events" argument in a way nothing else does. I can write code in a blocking style using typical libraries, and have it scale to large numbers of active connections.
On other systems the implementations don't scale or I have to heavily restrict library use based on stack growth, or I am tied to a particular OS. It seems to me the only alternatives to Go's nice blocking code environment require significant compromise or require something to be built.