fwiw IOCP in NT predates the similar mechanisms in Linux by at least a decade (and the VMS QIO scheme upon which it was in turn based is even older). As I understand it the reason Unix(1) (and then Linux) did not have efficient network I/O kernel interfaces until relatively recently was due to fear of patent litigation from MS.
(1) except for AIX, possibly due to IBM being less concerned about MS patents in this area.
Considering that Windows NT's IOCP is very close to direct copy of VMS QIO mechanism (and even more underneath in officially undocumented boundary layer between user space and kernel space), I don't think it's a case of patents.
UNIX was just always against asynchronous I/O, back from the start - asynchronous I/O schemes were known and used as early as original UNIX, and going with fully synchronous model was explicit design choice at Bell Labs.
When asynchronous I/O turned out to be important enough to include after all, there was no common enough interface to handle it and everyone was using select() and poll() out of lack of anything better for the most obvious use cases of AIO (networking). Meanwhile properly implementing asynchronous I/O can be non-trivial - QIO never ran truly multithreaded from the PoV of client program, for example (NT focused on making sure async worked from start).
https://techmonitor.ai/technology/dec_forced_microsoft_into_...
Microsoft hired the main architect of VMS, Dave Cutler, away from DEC to design Windows NT - so this shouldn’t be a surprise.
https://stackoverflow.com/questions/30688028/un-associate-so...
Edit: also, are you guys hiring? ;)
Something like a disk elevator at a higher level of the stack?
I asked DJ (not on HN, but hangs out in our community Slack [3] where you can ask further if curious), who knows the disk side of things best, and he responds:
The OS is free to reorder writes (this is true for both io_uring and conventional IO).
In practice it does this for spinning disks, but not SSDs.
The OS is aware of the "geometry" of a spinning disk, i.e. what sectors are physically close to each other.
But for NVME SSDs it is typically handled in the firmware. SSDs internally remap "logical" addresses (i.e. the address from the OS point of view) to "physical" addresses (actual locations on the SSD).
e.g. if the application (or OS) writes to block address "1" then "2", the SSD does not necessarily store these in adjacent physical locations. (OSTEP explains this well [0].)
"Performance Analysis of NVMe SSDs and their Implication on Real World Databases" explains in more detail:
> In the conventional SATA I/O path, an I/O request arriving at the block layer will first be inserted into a request queue (Elevator). The Elevator would then reorder and combine multiple requests into sequential requests. While reordering was needed in HDDs because of their slow random access characteristics, it became redundant in SSDs where random access latencies are almost the same as sequential. Indeed, the most commonly used Elevator scheduler for SSDs is the noop scheduler (Rice 2013), which implements a simple First-In-First-Out (FIFO) policy without any reordering.
Applications can help performance by grouping writes according to time-of-death (per "The Unwritten Contract of Solid State Drives" [2]), but the SSD is free to do whatever. We are shortly going to be reworking the LSM's compaction scheduling to take advantage of this: https://github.com/tigerbeetledb/tigerbeetle/issues/269.
[0] https://pages.cs.wisc.edu/~remzi/OSTEP/file-ssd.pdf
[1] https://www.cs.binghamton.edu/~tameesh/pubs/systor2015.pdf
[2] https://pages.cs.wisc.edu/~jhe/eurosys17-he.pdf
[3] https://join.slack.com/t/tigerbeetle/shared_invite/zt-1gf3qn...
One of the reasons is that libdispatch's I/O functions introduce extra dynamic allocations for internal queueing via `dispatch_async` ([0],[1],[2]) and from an API perspective of realloc-ing [3] an internally owned [4] buffer.
TigerBeetle, on the other hand, statically allocates all I/O buffers upfront [5], treats these buffers as intrusively-provided typed data [6] (no growing/owned buffers), and does internal queueing without synchronization or dynamic allocation [7].
[0]: https://github.com/apple/swift-corelibs-libdispatch/blob/469...
[1]: https://github.com/apple/swift-corelibs-libdispatch/blob/469...
[2]: https://github.com/apple/swift-corelibs-libdispatch/blob/469...
[3]: https://github.com/apple/swift-corelibs-libdispatch/blob/469...
[4]: https://developer.apple.com/documentation/dispatch/1388933-d...
[5]: https://tigerbeetle.com/blog/a-database-without-dynamic-memo...
[6]: https://github.com/tigerbeetledb/tigerbeetle/blob/d15acc663f...
[7]: https://github.com/tigerbeetledb/tigerbeetle/d15acc663f8882c...
I could probably change it to use io_uring and kqueue on those platforms, but I wanted to make a POSIX-compatible version first.