I call goroutines threads because they are user-level threads.
As an analogy, NVIDIA calls local threadgroups "warps", but that doesn't make them not local threadgroups.
> Creating and destroying kernel threads is significantly more expensive.
Because kernel threads usually have larger stacks. But they don't always have large stacks: that is configurable. Other than the stack size, the primary difference is simply that kernel threads are created in kernel space and user threads are created in userspace.
> A kernel thread has a fixed stack and if you go beyond, you crash. Which means that you have to create kernel threads with worst-case-scenario stack sizes (and pray that you got it right).
You can do stack switching in 1:1 too. After all, if you couldn't, then Go couldn't do stack switching at all, since goroutines are built on top of kernel threads.
Go's small stacks are really a property of the moving GC, not a property of the threading model.
> In a 4 core CPU there is no point in running more than 4 busy kernel threads but kernel scheduler has to give each thread a chance to run.
> Go runtime only creates as many threads as CPUs and avoids this waste.
Not if they're blocked doing I/O!
If they're not blocked doing I/O, then Go tries to do preemption just as the kernel does. (I say "tries to" because Go currently cannot preempt outside function boundaries; this is a significant downside of M:N threading compared to 1:1 kernel threading.)
> That's why high-perf servers (like nginx) don't just use kernel thread per connection and go through considerable complexity of writing event driven code.
High-performance servers like nginx use an event loop because it's the only way to get the absolute fastest performance, with no overhead of stacks at all. The fact that the project described in the article gets better performance than Go's threads is proof of that fact, in fact.
It would be possible, and interesting, to do Go-like 1:1 threading with small stacks.
> Go gives you straightforward programming model of thread-per-connection with scalability and performance much closer to event-driven model.
Sure. But that's mostly because of the GC, not because of the M:N threading model.
> Which is why it amazes me the lengths to which you go to denigrate Go in that respect and minimize what is a great and unique programming model among mainstream languages.
It's not unique. As I said, NGPT used to do M:N for pthreads. Solaris used to do M:N for pthreads. The JVM used to do M:N.