A programmer-friendly I/O abstraction over io_uring and kqueue (opens in new tab)

(tigerbeetle.com)

127 pointssimonz053y ago33 comments

33 comments

"The good news is that Windows also has a completion based system similar to io_uring but without batching, called IOCP"

fwiw IOCP in NT predates the similar mechanisms in Linux by at least a decade (and the VMS QIO scheme upon which it was in turn based is even older). As I understand it the reason Unix(1) (and then Linux) did not have efficient network I/O kernel interfaces until relatively recently was due to fear of patent litigation from MS.

(1) except for AIX, possibly due to IBM being less concerned about MS patents in this area.

dataflow3y ago

Windows has I/O rings now too. https://learn.microsoft.com/en-us/windows/win32/api/ioringap...

p_l3y ago

TL;DR - Async I/O wasn't included in Unices because it's hard & complex and unices are about keeping to the simple till you can no longer lie about it being painless.

Considering that Windows NT's IOCP is very close to direct copy of VMS QIO mechanism (and even more underneath in officially undocumented boundary layer between user space and kernel space), I don't think it's a case of patents.

UNIX was just always against asynchronous I/O, back from the start - asynchronous I/O schemes were known and used as early as original UNIX, and going with fully synchronous model was explicit design choice at Bell Labs.

When asynchronous I/O turned out to be important enough to include after all, there was no common enough interface to handle it and everyone was using select() and poll() out of lack of anything better for the most obvious use cases of AIO (networking). Meanwhile properly implementing asynchronous I/O can be non-trivial - QIO never ran truly multithreaded from the PoV of client program, for example (NT focused on making sure async worked from start).

dboreham3y ago

MS fought a legal battle with Dec over Cutler's move. Supposedly that ended up settled such that MS had license to Dec patents and explains why NT resembles VMS yet no lawsuit from Dec. Meanwhile other Unix vendors did not have said license. Fwiw I heard this story directly from Unix vendors OS devs when I worked at Netscape on servers (that supported IOCP on NT and hence often performed much better on a low end PC vs high end Unix iron).

https://techmonitor.ai/technology/dec_forced_microsoft_into_...

p_l3y ago

The legal battle happened because it wasn't just that MS hired Cutler out of Digital - They effectively hired out a team that will as already working on design for next generation OS based on the same principles as VMS, and litigation went down on that IP.

1 more reply

tiffanyh3y ago

> Considering that Windows NT's IOCP is very close to direct copy of VMS QIO mechanism

Microsoft hired the main architect of VMS, Dave Cutler, away from DEC to design Windows NT - so this shouldn’t be a surprise.

https://news.ycombinator.com/item?id=32602132

p_l3y ago

Not only that, Cutler went to MS essentially with a team and a project (that didn't get accepted at Digital) to redo VMS from scratch using same principles but with new experience and in cleaner way.

amluto3y ago

IOCP was missing a rather obvious feature until recently: disassociating a handle from an IOCP:

https://stackoverflow.com/questions/30688028/un-associate-so...

magicalhippo3y ago

Just curious, what's the use-case for this?

amluto3y ago

Any library that is supposed to leave a socket or other handle in the same state in which it got it.

1 more reply

eatonphil3y ago

Hey folks! Phil from TigerBeetle here. Happy to answer questions or pull in King when I cannot. :)

packetlost3y ago

Is there any attempt to reorder IO events such that writes (and maybe reads) operate in as contiguous of a manner as possible? It might only be relevant in the case of multiple writes to the same file, so the complexity trade-off might not be worth it for TigerBeetle, but might be worth it for other systems.

Edit: also, are you guys hiring? ;)

loeg3y ago

> Is there any attempt to reorder IO events such that writes (and maybe reads) operate in as contiguous of a manner as possible?

Something like a disk elevator at a higher level of the stack?

packetlost3y ago

Yeah, even tweaking kernel IO scheduling would probably be sufficient. It'll depend on spinning rust vs SSDs though.

1 more reply

eatonphil3y ago

We are definitely hiring a designer in Europe. :)

I asked DJ (not on HN, but hangs out in our community Slack [3] where you can ask further if curious), who knows the disk side of things best, and he responds:

The OS is free to reorder writes (this is true for both io_uring and conventional IO).

In practice it does this for spinning disks, but not SSDs.

The OS is aware of the "geometry" of a spinning disk, i.e. what sectors are physically close to each other.

But for NVME SSDs it is typically handled in the firmware. SSDs internally remap "logical" addresses (i.e. the address from the OS point of view) to "physical" addresses (actual locations on the SSD).

e.g. if the application (or OS) writes to block address "1" then "2", the SSD does not necessarily store these in adjacent physical locations. (OSTEP explains this well [0].)

"Performance Analysis of NVMe SSDs and their Implication on Real World Databases" explains in more detail:

> In the conventional SATA I/O path, an I/O request arriving at the block layer will first be inserted into a request queue (Elevator). The Elevator would then reorder and combine multiple requests into sequential requests. While reordering was needed in HDDs because of their slow random access characteristics, it became redundant in SSDs where random access latencies are almost the same as sequential. Indeed, the most commonly used Elevator scheduler for SSDs is the noop scheduler (Rice 2013), which implements a simple First-In-First-Out (FIFO) policy without any reordering.

Applications can help performance by grouping writes according to time-of-death (per "The Unwritten Contract of Solid State Drives" [2]), but the SSD is free to do whatever. We are shortly going to be reworking the LSM's compaction scheduling to take advantage of this: https://github.com/tigerbeetledb/tigerbeetle/issues/269.

[0] https://pages.cs.wisc.edu/~remzi/OSTEP/file-ssd.pdf

[1] https://www.cs.binghamton.edu/~tameesh/pubs/systor2015.pdf

[2] https://pages.cs.wisc.edu/~jhe/eurosys17-he.pdf

[3] https://join.slack.com/t/tigerbeetle/shared_invite/zt-1gf3qn...

eps3y ago

Write (and read) request reordering happens on Windows with all disk types. In theory it shouldn't be, but in practice we have stats that show that it does. Just fyi.

williamcotton3y ago

Is there a reason why you’re not using libdispatch on darwin and instead using kqueue directly?

kprotty3y ago

King from TigerBeetle here.

One of the reasons is that libdispatch's I/O functions introduce extra dynamic allocations for internal queueing via `dispatch_async` ([0],[1],[2]) and from an API perspective of realloc-ing [3] an internally owned [4] buffer.

TigerBeetle, on the other hand, statically allocates all I/O buffers upfront [5], treats these buffers as intrusively-provided typed data [6] (no growing/owned buffers), and does internal queueing without synchronization or dynamic allocation [7].

[0]: https://github.com/apple/swift-corelibs-libdispatch/blob/469...

[1]: https://github.com/apple/swift-corelibs-libdispatch/blob/469...

[2]: https://github.com/apple/swift-corelibs-libdispatch/blob/469...

[3]: https://github.com/apple/swift-corelibs-libdispatch/blob/469...

[4]: https://developer.apple.com/documentation/dispatch/1388933-d...

[5]: https://tigerbeetle.com/blog/a-database-without-dynamic-memo...

[6]: https://github.com/tigerbeetledb/tigerbeetle/blob/d15acc663f...

[7]: https://github.com/tigerbeetledb/tigerbeetle/d15acc663f8882c...

williamcotton3y ago

Thank you! I was indeed looking for the technical details!

jorangreef3y ago

At the time, we wanted to get macOS or Darwin running as soon as possible to improve the local developer experience, but—we've always had a soft spot for FreeBSD so kqueue was two birds in one!

haneefmubarak3y ago

It'd be super cool to be able to use this as a standalone library — do you have an estimate as to how much work / time it might take to split this out?

tiffanyh3y ago

Can this benefit BSDs?

eatonphil3y ago

I think FreeBSD invented kqueue and from a quick search it looks like OpenBSD and NetBSD also adopted it. We've seen at least one person slightly tweak TigerBeetle to run on FreeBSD already through the darwin code paths.

cb223y ago

> We've seen at least one person slightly tweak TigerBeetle to run on FreeBSD already through the darwin code paths.

I'm part of that sample set! Was quite surprised how easy it was to get it up and running on FreeBSD. Benchmarking on tmpfs on both, it even had a ~10% lead on Linux.

(Of course, that's not exactly the intended use case, so don't pay too much attention to that number!)

valzam3y ago

I just watched your talk at the CMU database talks. Just wanted to say I really appreciate reading/hearing about your approach! TB is a super interesting system, I hope I get to properly use it someday.

mgoetzke3y ago

Which talk is this? This one: https://www.youtube.com/watch?v=FyGukn77gqA ?

jorangreef3y ago

That's the one!

jorangreef3y ago

Thanks @valzam! Really appreciate hearing also that you enjoyed our talk! Would be great to stay in touch.

ghoward3y ago

Funny. I've implemented something like this as well, but over `poll()`, and I made it handle child processes too.

I could probably change it to use io_uring and kqueue on those platforms, but I wanted to make a POSIX-compatible version first.

j / k navigate · click thread line to collapse

33 comments

dboreham3y ago

"The good news is that Windows also has a completion based system similar to io_uring but without batching, called IOCP"

(1) except for AIX, possibly due to IBM being less concerned about MS patents in this area.

dataflow3y ago

Windows has I/O rings now too. https://learn.microsoft.com/en-us/windows/win32/api/ioringap...

p_l3y ago

TL;DR - Async I/O wasn't included in Unices because it's hard & complex and unices are about keeping to the simple till you can no longer lie about it being painless.

dboreham3y ago

https://techmonitor.ai/technology/dec_forced_microsoft_into_...

p_l3y ago

1 more reply

tiffanyh3y ago

> Considering that Windows NT's IOCP is very close to direct copy of VMS QIO mechanism

Microsoft hired the main architect of VMS, Dave Cutler, away from DEC to design Windows NT - so this shouldn’t be a surprise.

https://news.ycombinator.com/item?id=32602132

p_l3y ago

Not only that, Cutler went to MS essentially with a team and a project (that didn't get accepted at Digital) to redo VMS from scratch using same principles but with new experience and in cleaner way.

amluto3y ago

IOCP was missing a rather obvious feature until recently: disassociating a handle from an IOCP:

https://stackoverflow.com/questions/30688028/un-associate-so...

magicalhippo3y ago

Just curious, what's the use-case for this?

amluto3y ago

Any library that is supposed to leave a socket or other handle in the same state in which it got it.

1 more reply

eatonphil3y ago

Hey folks! Phil from TigerBeetle here. Happy to answer questions or pull in King when I cannot. :)

packetlost3y ago

Edit: also, are you guys hiring? ;)

loeg3y ago

> Is there any attempt to reorder IO events such that writes (and maybe reads) operate in as contiguous of a manner as possible?

Something like a disk elevator at a higher level of the stack?

packetlost3y ago

Yeah, even tweaking kernel IO scheduling would probably be sufficient. It'll depend on spinning rust vs SSDs though.

1 more reply

eatonphil3y ago

We are definitely hiring a designer in Europe. :)

I asked DJ (not on HN, but hangs out in our community Slack [3] where you can ask further if curious), who knows the disk side of things best, and he responds:

The OS is free to reorder writes (this is true for both io_uring and conventional IO).

In practice it does this for spinning disks, but not SSDs.

The OS is aware of the "geometry" of a spinning disk, i.e. what sectors are physically close to each other.

e.g. if the application (or OS) writes to block address "1" then "2", the SSD does not necessarily store these in adjacent physical locations. (OSTEP explains this well [0].)

"Performance Analysis of NVMe SSDs and their Implication on Real World Databases" explains in more detail:

[0] https://pages.cs.wisc.edu/~remzi/OSTEP/file-ssd.pdf

[1] https://www.cs.binghamton.edu/~tameesh/pubs/systor2015.pdf

[2] https://pages.cs.wisc.edu/~jhe/eurosys17-he.pdf

[3] https://join.slack.com/t/tigerbeetle/shared_invite/zt-1gf3qn...

eps3y ago

Write (and read) request reordering happens on Windows with all disk types. In theory it shouldn't be, but in practice we have stats that show that it does. Just fyi.

williamcotton3y ago

Is there a reason why you’re not using libdispatch on darwin and instead using kqueue directly?

kprotty3y ago

King from TigerBeetle here.

[0]: https://github.com/apple/swift-corelibs-libdispatch/blob/469...

[1]: https://github.com/apple/swift-corelibs-libdispatch/blob/469...

[2]: https://github.com/apple/swift-corelibs-libdispatch/blob/469...

[3]: https://github.com/apple/swift-corelibs-libdispatch/blob/469...

[4]: https://developer.apple.com/documentation/dispatch/1388933-d...

[5]: https://tigerbeetle.com/blog/a-database-without-dynamic-memo...

[6]: https://github.com/tigerbeetledb/tigerbeetle/blob/d15acc663f...

[7]: https://github.com/tigerbeetledb/tigerbeetle/d15acc663f8882c...

williamcotton3y ago

Thank you! I was indeed looking for the technical details!

jorangreef3y ago

At the time, we wanted to get macOS or Darwin running as soon as possible to improve the local developer experience, but—we've always had a soft spot for FreeBSD so kqueue was two birds in one!

haneefmubarak3y ago

It'd be super cool to be able to use this as a standalone library — do you have an estimate as to how much work / time it might take to split this out?

tiffanyh3y ago

Can this benefit BSDs?

eatonphil3y ago

cb223y ago

> We've seen at least one person slightly tweak TigerBeetle to run on FreeBSD already through the darwin code paths.

I'm part of that sample set! Was quite surprised how easy it was to get it up and running on FreeBSD. Benchmarking on tmpfs on both, it even had a ~10% lead on Linux.

(Of course, that's not exactly the intended use case, so don't pay too much attention to that number!)

valzam3y ago

mgoetzke3y ago

Which talk is this? This one: https://www.youtube.com/watch?v=FyGukn77gqA ?

jorangreef3y ago

That's the one!

jorangreef3y ago

Thanks @valzam! Really appreciate hearing also that you enjoyed our talk! Would be great to stay in touch.

ghoward3y ago

Funny. I've implemented something like this as well, but over `poll()`, and I made it handle child processes too.

I could probably change it to use io_uring and kqueue on those platforms, but I wanted to make a POSIX-compatible version first.

j / k navigate · click thread line to collapse