See: Why User-Mode Threads Are Good for Performance https://youtu.be/07V08SB1l8c
Also - millions of Java programmers thank you for not going to async/await. What an evil source-code virus (among other things that is).
I tried to watch it at 1.25x speed as I normally do, but you already talk at 1.25x speed, so no need !
Also, both throughput and latency are performance metrics.
Reading that, it also makes me wonder what happens for disk I/O? Many other runtimes, both "green thread" ones like Golang and asynchronous like libuv/tokio, use a blocking thread pool (static or elastic) to offload these kernel syscalls to because, from what I've read, those syscalls are not easily made non-blocking like e.g epoll is. Does Java Virtual Threads do the same, or does disk I/O block the carrier threads? For curiosity, does Java file APIs use io_uring on Linux if it is available? It is a fairly recently added kernel API for achieving truly non-blocking I/O, including disk I/O. It doesn't seem to bring much over epoll in terms of performance, but has been a boon for disk I/O and in general can reduce context switches with the kernel by reducing the amount of syscalls needed.
[1]: https://inside.java/2021/05/10/networking-io-with-virtual-th...
Yes.
> Does Java Virtual Threads do the same, or does disk I/O block the carrier threads? For curiosity, does Java file APIs use io_uring on Linux if it is available?
We're working on using io_uring where available, especially for filesystem IO. For now, filesystem IO blocks OS threads but we temporarily compensate by increasing the size of the scheduler's worker thread pool.
That measurement told me that it's not necessary to use io_uring for disk I/O performance for some workloads.
It found no improvement in performance from io_uring, compared with a dynamic thread pool which tries to maintain enough I/O-blocked threads to keep the various kernel and device queues busy enough.
This was a little surprising, because the read-syscall overhead when using threads was measurable. preadv2() was surprisingly much slower than pread(), so I used the latter. I used CLONE_IO and very small stacks for the I/O threads (less than a page; about 1kiB IIRC), but the performance was pretty good using only pthreads without those thread optimisations. Probably I had a good thread pool and queue logic, as it surprised me that the result was much faster than "fio" banchmark results had led me to expect.
In principle, io_uring should be a little more robust to different scenarios with competing processes, compared with blocking I/O threads, because it has access to kernel scheduling in a way that userspace does not. I also expect io_uring to get a little faster with time, compared with the kernel I tested on.
However, on Linux, OS threads* have been the fastest way to do filesystem and block-device I/O for a long time. (* except for CLONE_IO not being set by default, but that flag is ignored in most configurations in current kernels),