Myth. Performance won't be better. Scaling arguably is better, but usually the use-case doesn't require the level of scaling where async is superior to OS threads.
E.g. go ahead and implement a RPC server which e.g. only has to deal with 10 concurrent requests - then measure latencies. The synchronous version might be faster, due to not requiring any epoll calls. The different might get even bigger if e.g. the server is serving static files, and you are measuring throughput - the synchronous version will likely provide higher performance since no extra context-switch from the async-runtime-of-your-choice to threadpool-for-file-io thread and back is required.
You are also right in that once one moves beyond a certain scale the async version might offer better performance. But the scale that is required would be different per application, and not every application requires the scale.
It certainly isn't like you use a green thread model and you unconditionally throw away a 5x performance factor or something.
There are absolutely cases where that does matter. To name just one, a game engine would not want to throw away that level of performance out of the box. (That's the game engine user's job, to "spend" the quality of the game engine on their task.) But I think there's a lot more programmers who have, without analysis, assumed they're in that class and made a lot of decisions based on that, when in fact they are plural orders of magnitude away from it. To pick a number out thin air, 4 full CPU cores running Rust code that someone has at least glanced at and spent a bit of time optimizing is a loooooot of power.
(The closest current comparison is Rust vs. Go, but Rust works much harder at compile-time optimization and doesn't have GC, and I expect those two things account for the majority of the delta between them, with Go being greenthreaded being non-trivial, but in the clear minority. Stay tuned for Java with Project Loom versus Rust, which has its own rather major differences but will at least be another relevant data point.)
Edit: also this only tests 500, not 500000.
Also when doing threaded I/O as soon as you want to support bidirectional traffic you will have to implement select/poll/etc. since you can't do a blocking read and a blocking write at the same time on one thread. At that point you're already giving up a lot of the advantages of threads.
FWIW, there's an effort to do exactly that, but because it will require language level changes and it is just on the drawing board phase, it will likely be a while before it can be widely used.
The "optionality" of `async` while sharing code also applies for `const` and mutability (why do we need `Deref` and `DerefMut`?). Finding a solution that can work for these three (and maybe others?) parts of the language will be a welcome improvement.
Rust async code can be a bit challenging until you get it, but I can't think of a way to make it that much simpler without sacrificing the whole "systems programming language" concept or support for embedded. The only good alternative is Go-like fibers and that requires a fat runtime.
We use both Rust and Go at ZeroTier and find that they both have their own niches. (We are slowly moving ZeroTier from C++ to Rust to use a more modern and more importantly safe language.)
Could you link where?