> The io-uring crate doesn’t help much with this. The API doesn’t allow the borrow checker to protect you at compile time, and I don’t see it doing any runtime checks either.
I've seen comments like this before[1], and I get the impression that building a a safe async Rust library around io_uring is actually quite difficult. Which is sort of a bummer.
IIRC Alice from the tokio team also suggested there hasn't been much interest in pushing through these difficulties more recently, as the current performance is "good enough".
Think about it for a second. Why do we not have this problem with "synchronous" syscalls? When you call `read` you also "pass mutable borrow" of the buffer to the kernel, but it maps well into the Rust ownership/borrow model since the syscall blocks execution of the thread and there are no ways to prevent it in user code. With poll-based async model you side-step this issues since you use the same "sync" syscalls, but which are guaranteed to return without blocking.
For a completion-based IO to work properly with the ownership/borrow model we have to guarantee that the task code will not continue execution until it receives a completion event. You simply can not do it with state machines polled in user code. But the threading model fits here perfectly! If we are to replace threads with "green" threads, user Rust code will look indistinguishable from "synchronous" code. And no, the green threads model can work properly on embedded systems as demonstrated by many RTOSes.
There are several ways of how we could've done it without making the async runtime mandatory for all targets (the main reason why green threads were removed from Rust 1.0). My personal favorite is introduction of separate "async" targets.
Unfortunately, the Rust language developers made a bet on the unproved polling stackless model because of the promised efficiency and we are in the process of finding out whether the bet plays of or not.
That's not really true. The only guarantees in Rust futures are that they are polled() once and must have their Waker's wake() called before they are polled again. A completion based future submits the request on first poll and calls wake() on completion. That's kind of the interesting design of futures in Rust - they support polling and completion.
The real conundrum is that the futures are not really portable across executors. For io_using for example, the executor's event loop is tightly coupled with submission and completion. And due to instability of a few features (async trait, return impl trait in trait, etc) there is not really a standard way to write executor independent async code (you can, some big crates do, but it's not necessarily trivial).
Combine that with the fact that container runtimes disable io_uring by default and most people are deploying async web servers in Docker containers, it's easy to see why development has stalled.
It's also unfair to mischaracterize design goals and ideas from 2016 with how the ecosystem evolved over the last decade, particularly after futures were stabilized before other language items and major executors became popular. If you look at the RFCs and blog posts back then (eg: https://aturon.github.io/tech/2016/09/07/futures-design/) you can see why readiness was chosen over completion, and how completion can be represented with readiness. He even calls out how naïve completion (callbacks) leads to more allocation on future composition and points to where green threads were abandoned.
No, this is a mistaken retelling of history. The Rust developers were not ignorant of IOCP, nor were they zealous about any specific async model. They went looking for a model that fit with Rust's ethos, and completion didn't fit. Aaron Turon has an illuminating post from 2016 explaining their reasoning: https://aturon.github.io/tech/2016/09/07/futures-design/
See the section "Defining futures":
There’s a very standard way to describe futures, which we found in every existing futures implementation we inspected: as a function that subscribes a callback for notification that the future is complete.
Note: In the async I/O world, this kind of interface is sometimes referred to as completion-based, because events are signaled on completion of operations; Windows’s IOCP is based on this model.
[...] Unfortunately, this approach nevertheless forces allocation at almost every point of future composition, and often imposes dynamic dispatch, despite our best efforts to avoid such overhead.
[...] TL;DR, we were unable to make the “standard” future abstraction provide zero-cost composition of futures, and we know of no “standard” implementation that does so.
[...] After much soul-searching, we arrived at a new “demand-driven” definition of futures.
I'm not sure where this meme came from where people seem to think that the Rust devs rejected a completion-based scheme because of some emotional affinity for epoll. They spent a long time thinking about the problem, and came up with a solution that worked best for Rust's goals. The existence of a usable io_uring in 2016 wouldn't have changed the fundamental calculus.
&mut references are exclusive and non-copyable, so the hot potato approach can even be used within their scope.
But the problem in Rust is that threads can unwind/exit at any time, invalidating buffers living on the stack, and io_uring may use the buffer for longer than the thread lives.
The borrow checker only checks what code is doing, but doesn't have power to alter runtime behavior (it's not a GC after all), so it only can prevent io_uring abstractions from getting any on-stack buffers, but has no power to prevent threads from unwinding to make on-stack buffer safe instead.
Fn(_: T) -> TWell, I think there is interest, but mostly for file IO.
For file IO, the situation is pretty simple. We already have to implement that using spawn_blocking, and spawn_blocking has the exact same buffer challenges as io_uring does, so translating file IO to io_uring is not that tricky.
On the other hand, I don't think tokio::net's existing APIs will support io_uring. Or at least they won't support the buffer-based io_uring APIs; there is no reason they can't register for readiness through io_uring.
High throughput network usecases that don’t need/want AF_XDP or DPDK can get most of the speedup with ‘sendmmsg/recvmmsg’ and segmentation offload.
As an example this library I wrote before is cancel safe and doesn’t use lifetimes etc. for it.
It is just a PITA to get it fully right.
Probably need the buffer to come from the async library so user allocates the buffers using the async library like a sibling comment says.
It is just much easier to not use Rust and say futures should run fully always and can’t be just dropped and make some actual progress. So I’m just doing it in zig now
Have your function signature be async fn read(buffer: &mut Vec<u8>) -> Result<…>’ (you can use something more convenient like ‘&mut BytesMut’ too). If you run the future to completion (success or failure), the argument holds the same buffer passed in, with data filled in appropriately on success. If you cancel/drop the future, the buffer may point at an empty allocation instead (this is usually not an annoying constraint for most IO flows, and footgun potential is low).
The way this works is that your library “takes” the underlying allocation before starting the operation out of the variable, replacing it with the default unallocated ‘Vec<u8>’. Once the buffer is no longer used by the IO system, it puts it back before returning. If you cancel, it manages the buffer in the background to release it when safe and the unallocated buffer is left in the passed variable.
Or maybe I've misunderstood?
Your write up connected some early knowledge from when I was 11 where I was trying to set up a database/backend and was finding lots of cgi-bin online. I realize now those were spinning up new processes with each request https://en.wikipedia.org/wiki/Common_Gateway_Interface
I remember when sendfile became available for my large gaming forum with dozens of TB of demo downloads. That alone was huge for concurrency.
I thought I had swore off this type of engineering but between this, the Netflix case of extra 40ms and the GTA 5 70% load time reduction maybe there is a lot more impactful work to be done.
https://netflixtechblog.com/life-of-a-netflix-partner-engine...
https://nee.lv/2021/02/28/How-I-cut-GTA-Online-loading-times...
every HTTP session was commonly a forked
copy of the entire server in the CERN
and Apache lineage!
And there's nothing wrong with that for application workers. On *nix systems fork() is very fast, you can fork "the entire server" and the kernel will only COW your memory. As nginx etc. showed you can get better raw file serving performance with other models, but it's still a legitimate technique for application logic where business logic will drown out any process overhead.The same model is possible in Apache httpd 2.x with the "prefork" mpm.
Depends on the workload.
Normally you would go read() -> write() so:
1. Disk -> page cache (DMA)
2. Kernel -> user copy (read)
3. User -> kernel copy (write)
4. Kernel -> NIC (DMA)
sendfile():
1. Disk -> page cache (DMA)
No user space copies, kernel wires those pages straight to the socket
2. Kernel -> NIC (DMA)
So basically, it eliminates 1-2 memory copies along with the associated cache pollution and memory bandwidth overhead. If you are running high QPS web services where syscall and copy overheads dominate, for example CDNs/static file serving the gains can be really big. Based on my observations this can mean double digit reductions in CPU usage and up to ~2x higher throughput.
sendfile effectively turns your user space file server into a control plane, and moves the data plane to where the data is eliminating copies between address spaces. This can be made congruent with I/O completions (i.e. Ethernet+IP and block) and made asynchronous so the entire thing is pumping data between completion events. Watch the Netflix video the author links in the post.
There is an inverted approach where you move all this into a single user address space, i.e. DPDK, but it's the same overall concept just a different who.
I am patient to wait for the benchmarks so take your time ,but I honestly love how the author doesn't care about benchmarks right now and wanted to clean the code first. Its kinda impressive that there are people who have such line of thinking in this world where benchmarks gets maxxed and whole project's sole existence is to satisfy benchmarks.
Really a breath of fresh air and honestly I admire the author so much for this. It was such a good read, loved it a lot thank you. Didn't know ktls existed or Io_uring could be used in such a way.
I can recommend writing even the BPF side of things with rust using Aya[1].
On FreeBSD, its been in the kernel / openssl since 13, and has been one runtime toggle (sysctl kern.ipc.tls.enable=1) away from being enabled. And its enabled by default in the upcoming FreeBSD-15.
We (at Netflix) have run all of our tls encrypted streaming over kTLS for most of a decade.
Rust - you need to understand: Futures, Pin, Waker, async runtimes, Send/Sync bounds, async trait objects, etc.
C++20, coroutines.
Go, goroutines.
Java21+, virtual threads
In any event it's essentially a stack frame so it's not a failure of zero-overhead, the stack frame will need to be somewhere.
Go: goroutines are not async. And you can't understand goroutines without understanding channels. And channels are weirdly implemented in Go, where the semantics of edge cases, while well defined, are like rolling a D20 die if you try to reason from first principles.
Go doesn't force you to understand things. I agree with that. It has pros and cons.
I see what you mean but "cheap threads" is not the same thing as async. More like "current status of massive concurrency". Except that's not right either. tarweb, the subject of the blog post in question, is single threaded and uses io_uring as an event loop. (the idea being to spin up one thread per CPU core, to use full capacity)
So it's current status of… what exactly?
Cheap threads have a benefit over an async loop. The main one being that they're easier to reason about. It also has drawbacks. E.g. each thread may be light weight, but it does need a stack.
Sure they are. The abstraction they provide is a synchronous API, but it's accomplished using an async runtime.
So to reimplement my foundation (with all the bugs) will not be worth it.
I will however compare Javas NIO (epoll) with the new Virtual Threads IO (without pinning).
https://github.com/axboe/liburing/wiki/io_uring-and-networki...
Also there is napi support in uring which uses polled io on sockets instead of interrupt based io from what I understand. You can see examples using it in liburing github
In my experience “oversubscribing” threads to cores (more threads than cores) provides a wall-clock time benefit.
I think one thread per core would work better without preemptive scheduling.
But then we aren’t talking about Unix.
This works fine on Linux, and common approach for trading systems where it’s fine to oversubscribe a bunch of cores for this type of stuff. The cores are mostly busy spinning and doing nothing, so it’s very inefficient in terms of actual work, but great for latency and throughput when you need it.
It's not blanket good advice for all things.
Most developers are unfamiliar with the design idioms for TPC e.g. how to properly balance and shed load between cores.
In this very specific case, it seems as though the vast majority of the webserver's work is asynchronous and event-based, so the actual webserver is never waiting on I/O input or output - once it's ready you dump it somewhere the kernel can get to it and move on to the next request if there is one.
I think this gets this specific project close to the platonic ideal of a one-thread-per-core workload if indeed you're never waiting on I/O or any syscalls, but I feel as though it should come with extreme caveats of "this is almost never how the real world works so don't go artificially limiting your application to `nproc` threads without actually testing real-world use cases first".
io_uring is very cool tech though and has been progressing at an impressive pace the last few years.
It seems like there’s these fundamental things in OSes that we just can’t improve, or I suppose can’t without breaking too much backward compatibility, so we are forced to do this.
There's a software equivalent of the Peter Principle where software or an API becomes increasingly complex to the point where no one understands it. They then attempt to fix that by adding more functionality (complexity).
I am working on something like this for work. But with plain old C
> In order to avoid busy looping, both the kernel and the web server will only busy-loop checking the queue for a little bit (configurable, but think milliseconds), and if there’s nothing new, the web server will do a syscall to “go to sleep” until something gets added to the queue.
Without load the overhead of calling (effectively) sleep() is, while technically true, not relevant.
But sure, you can tweak the busyloop timers and burn 100% CPU on kernel and user side indefinitely if you want to avoid that sleep-when-idle syscall. It's just… not a good idea.
First, there are some tricks required to actually make it work at all, then there is a problem that you'll need a core not only for userland, but also inside the kernel, both of them per-application.
Sharing a kernel spinning thread across multiple applications is also possible but requires further efforts (you need to share some parent ring across processes, which need to be related).
Overall I feel that it doesn't really deliver on the no-system-call idea, certainly not out of the box. You might have a more straightforward experience with XDP, which coincidentally gives you a lot more access and control as well if you need it.
> This means that a busy web server can serve all of its queries without even once (after setup is done) needing to do a syscall. As long as queues keep getting added to, strace will show nothing.
For comparison a read/write over a TCP socket on loopback between two process is a few microseconds using BSD sockets API.
No? What they're saying is the busy loop will spin until an event occurs, for at most x ms. And if it does park the thread (the only syscall required), it can be immediately woken up on the first event too. Only if multiple events occurred since the last call would you receive them together. This normally happens only under high load, when event processing takes enough time to have a buildup of new events in the background. Increased latency is the intended outcome on high loads.
To be fair, it was a while ago I read the io-uring paper. But I distinctly recall the mix of poll and park behavior, plus configurable wait conditions. Please correct me if I'm wrong (someone here certainly knows).
What a time to be alived that seconds to recompile is consider horrible devex.