undefined | Better HN

0 pointspron2y ago0 comments

The relationship between throughput, latency, and concurrency in servers is expressed via Little's theorem. If your server is written in the thread-per-request style -- the only style for which the platform offers built-in language, VM, and tooling support -- then the most important factor affecting maximum throughput is the number of threads you can have (until, of course, the hardware is fully utilised). Being able to support many threads is the most effective improvement to server throughput you can offer.

See: Why User-Mode Threads Are Good for Performance https://youtu.be/07V08SB1l8c

0 comments

tantamounta2y ago

Thanks for the video. I feel like there's a bit of conflation between the terms "performance(latency)" and "throughput", but I see the point. I'd be interested to see that latency graph (Time marker 15:38) between platform and virtual threads in the case where the server doesn't manufacture a 100ms delay (say, in the case of a caching reverse-proxy).

Also - millions of Java programmers thank you for not going to async/await. What an evil source-code virus (among other things that is).

I tried to watch it at 1.25x speed as I normally do, but you already talk at 1.25x speed, so no need !

pronOP2y ago

To understand what happens when the server doesn't perform IO, apply Little's formula to the CPU only. Clearly, the maximum concurrency would be equal to the number of cores, which means that in that situation there would be no benefit to more threads than cores. What you would see in the graph would be that the server fails once L is equal to the number of cores. The average ratio between IO and CPU time as portions of the average duration would give you an upper limit on how much more throughput you gain by having more threads. That's what I explain at 11:34.

Also, both throughput and latency are performance metrics.

Svenskunganka2y ago

I watched the video and thoroughly enjoyed it, thank you for sharing it! I have a question that is perhaps not entirely related to the video, but it touches the topic of context switches. I've read this post [1] by Chris Hegarty, which explains that when calling the traditionally blocking network I/O APIs in the Java stdlib from a virtual thread, it uses asynchronous/poll-based kernel syscalls (IOCP, kqueue, epoll on Windows, Mac and Linux respectively) which I assume is to avoid blocking the carrier threads. That post was written in 2021, does it still hold true today in Java 21?

Reading that, it also makes me wonder what happens for disk I/O? Many other runtimes, both "green thread" ones like Golang and asynchronous like libuv/tokio, use a blocking thread pool (static or elastic) to offload these kernel syscalls to because, from what I've read, those syscalls are not easily made non-blocking like e.g epoll is. Does Java Virtual Threads do the same, or does disk I/O block the carrier threads? For curiosity, does Java file APIs use io_uring on Linux if it is available? It is a fairly recently added kernel API for achieving truly non-blocking I/O, including disk I/O. It doesn't seem to bring much over epoll in terms of performance, but has been a boon for disk I/O and in general can reduce context switches with the kernel by reducing the amount of syscalls needed.

[1]: https://inside.java/2021/05/10/networking-io-with-virtual-th...

pronOP2y ago

> That post was written in 2021, does it still hold true today in Java 21?

Yes.

> Does Java Virtual Threads do the same, or does disk I/O block the carrier threads? For curiosity, does Java file APIs use io_uring on Linux if it is available?

We're working on using io_uring where available, especially for filesystem IO. For now, filesystem IO blocks OS threads but we temporarily compensate by increasing the size of the scheduler's worker thread pool.

jlokier2y ago

In late 2021 I compared OS threads to io_uring for filesystem I/O at random-access reads from fast, NVMe SSDs.

That measurement told me that it's not necessary to use io_uring for disk I/O performance for some workloads.

It found no improvement in performance from io_uring, compared with a dynamic thread pool which tries to maintain enough I/O-blocked threads to keep the various kernel and device queues busy enough.

This was a little surprising, because the read-syscall overhead when using threads was measurable. preadv2() was surprisingly much slower than pread(), so I used the latter. I used CLONE_IO and very small stacks for the I/O threads (less than a page; about 1kiB IIRC), but the performance was pretty good using only pthreads without those thread optimisations. Probably I had a good thread pool and queue logic, as it surprised me that the result was much faster than "fio" banchmark results had led me to expect.

In principle, io_uring should be a little more robust to different scenarios with competing processes, compared with blocking I/O threads, because it has access to kernel scheduling in a way that userspace does not. I also expect io_uring to get a little faster with time, compared with the kernel I tested on.

However, on Linux, OS threads* have been the fastest way to do filesystem and block-device I/O for a long time. (* except for CLONE_IO not being set by default, but that flag is ignored in most configurations in current kernels),

1 more reply

statquontrarian2y ago

That is an absolutely amazing video. From a brief, intuitive, and well-diagrammed explanation of non-trivial concepts of queuing theory, to practical examples, to connecting it all to real-world use cases and value, and all within a surprisingly short period of time, it is one of the most impressive technical videos I've ever seen. Thank you.

j / k navigate · click thread line to collapse

0 comments

tantamounta2y ago

Also - millions of Java programmers thank you for not going to async/await. What an evil source-code virus (among other things that is).

I tried to watch it at 1.25x speed as I normally do, but you already talk at 1.25x speed, so no need !

pronOP2y ago

Also, both throughput and latency are performance metrics.

Svenskunganka2y ago

[1]: https://inside.java/2021/05/10/networking-io-with-virtual-th...

pronOP2y ago

> That post was written in 2021, does it still hold true today in Java 21?

Yes.

> Does Java Virtual Threads do the same, or does disk I/O block the carrier threads? For curiosity, does Java file APIs use io_uring on Linux if it is available?

jlokier2y ago

In late 2021 I compared OS threads to io_uring for filesystem I/O at random-access reads from fast, NVMe SSDs.

That measurement told me that it's not necessary to use io_uring for disk I/O performance for some workloads.

It found no improvement in performance from io_uring, compared with a dynamic thread pool which tries to maintain enough I/O-blocked threads to keep the various kernel and device queues busy enough.

1 more reply

statquontrarian2y ago

j / k navigate · click thread line to collapse