An io_uring-based user-space block driver (opens in new tab)

(lwn.net)

102 pointsveddan3y ago38 comments

38 comments

I wonder if this could replace most uses of NBD (network block devices), and/or help get iSCSI into userspace where more flexible load-balancing policy can be implemented.

It also reminds me of attempts to define BUSE[0][1][2], which would have been a block device equivalent of FUSE. IIRC attempts to get BUSE into the Linux kernel have been blocked for performance reasons -- the FUSE protocol isn't well designed and is only barely acceptable for VFS.

If io_uring (+ careful use of zero-copy) has fixed the performance issues with userspace block devices, maybe it would be applicable to FUSE (or FUSE-v2)? I've tried using io_uring with the current FUSE protocol to reduce syscall overhead and it kinda works, but a protocol designed to operate in that mode from the beginning would be even better.

[0] https://github.com/acozzette/BUSE

[1] https://dspace.cuni.cz/bitstream/handle/20.500.11956/148791/...

[2] https://dl.acm.org/doi/10.1145/3456727.3463768

benlwalker3y ago

The SPDK project is certainly looking to use this to replace our limited use of NBD, as well as present SPDK block devices as kernel block devices, including devices backed by userspace implementations of iSCSI, NVMe-oF, and various other network protocols.

notacoward3y ago

I had the same thought re: FUSE. I'm tentatively thinking of getting back into programming by working some on sshfs (because I'm bored and think it's important, while it's maintainer-less and very squarely within my specialty). Not until early September, though, since right now life is consumed by end-of-summer stuff and then getting my daughter off to college. Anyhow, within that context I've also thought about FUSE (which I also have some experience with since I added SELinux tag support) plus io_uring. Certainly nothing's likely to happen right away, but it will be on my personal roadmap.

rwmj3y ago

I'd be careful about giving up the network capabilities. With NBD it's very useful to move the client and server apart, either having them both run in userspace on the same machine over a Unix domain socket or talking remotely over the network. For our case this is by far the most common use of NBD, we hardly use nbd.ko at all.

loeg3y ago

Is BUSE significantly different from CUSE (“character device”)?

https://lwn.net/Articles/308445/

jmillikin3y ago

Yep! Character devices are much closer to "stream of bytes", and from the FUSE perspective they look like a single file with limited operations (open, close, read, write). Think of something like a mouse (sending a stream of motion/click events) or a webcam (send stream of frames, receive basic control commands). If you've written even the most basic FUSE layer, you've got all the necessary handlers to implement CUSE too.

Block devices operate on blocks of data identified by offset. Hard disks, CD-ROM drives, USB sticks, basically anything where it'd make sense to say "read (or write) these 1024 bytes at offset 0x10000".

You can in principle implement a block device-ish API in FUSE by disabling open/close and requiring all reads/writes to be at given offsets -- IIRC this is how the "fuseblk" mode added for ntfs-3g works -- but the protocol is too chatty to be fast enough for things people want block devices for.

I've also heard the kernel's block layer error handling doesn't interact well with the FUSE protocol, but I don't know the details too well on that.

1 more reply

ice33y ago

Somehow this reminded me of this post on LKLM:

- <odd>.x.x: Linus went crazy, broke absolutely _everything_, and rewrote the kernel to be a microkernel using a special message-passing version of Visual Basic. (timeframe: "we expect that he will be released from the mental institution in a decade or two").

[*] https://lkml.org/lkml/2005/3/2/247

It's really interesting to see Linux getting more and more micro-kernel like features throughout the years.

coolspot3y ago

Notably, the post is from a “decade or two” ago, so timeline matches.

benlwalker3y ago

For me, the killer use case for this is presenting logical volumes to containers. There just has not been an efficient mechanism for a local storage service in one container to serve logical volumes to another container on the same system until this. For VMs there is virtio/vfio-user, but for containers the highest performing option until this was NVMe-oF/TCP loopback.

Basically, you can implement a virtual SAN for containers efficiently with this.

topspin3y ago

I'd like a built-in iSCSI volume driver for docker, podman, et al. There are third party things (netapp trident[1], etc.) but no generic driver. One would think -- given the ubiquity of SAN boxes populating racks outside of cloud operators -- you could "-v iscsi:<rfc-4173-iscsi-uri>:/mountpoint" a network block device into a container out of the box. I suppose it's difficult to deal with in cross platform way. When you read the golang source for trident you see they're just exec-ing iscsiadm on linux container hosts.

[1] https://github.com/NetApp/trident

lifty3y ago

Very appealing. Do you think this solution would be comparable in performance with an in-kernel storage driver?

Joker_vD3y ago

That reminds me quite strongly of VirtIO (block) devices... and yet the actual command format is, of course, different. Why can't we stop re-inveting things over and over?

jmillikin3y ago

Because different use cases require different designs? If you try to create a protocol that can work for all purposes, it'll be a poor fit for any of them and will be out-competed by more specialized alternatives.

There's a reason emulators design their virtual devices to resemble real hardware (PCI, SCSI, USB) -- there's already going to be a bunch of code in the hypervisor to create fake hardware. It's also more practical to piggy-back on PCI (etc) when the spec needs to be implemented by competing vendors, since there's no kernel and no OS idioms involved. Not to mention various pre-kernel code such as EFI and bootloaders.

Conversely, userspace developers really do not want to be coding up a fake PCI device with registers and interrupts and so on just to get some bytes into the kernel. They want to invoke system calls (ioctl, mmap, io_uring) and let the OS handle the details.

Joker_vD3y ago

The basis of virtio is literally just a ring queue, with its request descriptors looking almost exactly like structs suitable for passing to readv(2) or writev(2); the PCI shim is built on top of that and is completely optional (you can have a purely MMIO virtio device, after all). It was built this way so that KVM would not have to mimic the idiosyncrasies of real hardware: passing data as-is to the physical devices can't work for obvious reasons (even if you disregard security completely); instead in can, after minimal processing, shove it into the Linux kernel and let it take care of the rest.

1 more reply

bullen3y ago

Is anything like this for networking done or in the works?

jmillikin3y ago

For networking the closest equivalent would be TUN/TAP, which lets userspace route either IP packets (TUN) or Ethernet frames (TAP).

dundarious3y ago

If I'm understanding ublk and your question correctly, then yes, there are a lot of kernel-bypass networking options out there, such as openonload, dpdk, mellanox (though they seem to have been absorbed into nvidia). You'll likely need a special/particular network card, an external kernel module, and at least an LD_PRELOAD to use them though.

bullen3y ago

Is there no way to avoid kernel copying all network data?

I understand the frustration of having the network driver crash but could it not be run in a way that it doesn't bring down the OS?

It seems to me Java would have a no-brainer advantage of a user-space networking option since you're already in a VM!?

When I saturate my HTTP server the kernel takes 30% of the CPU just copying data for no good reason?!

1 more reply

sanxiyn3y ago

There is XDP.

tptacek3y ago

XDP is kind of the opposite of this, right? It's moving userland code into the kernel.

1 more reply

stargrazer3y ago

yes, i think there is example code where io_ring is used to get blocks into and out of XDP/kernel.

trasz3y ago

So it’s essentially like userspace iSCSI server, but proprietary?

notacoward3y ago

Not proprietary, but not iSCSI-specific either. The whole idea is that you can use any protocol you like. Could be iSCSI, could be NBD, could be AoE, could be something proprietary but that's less likely than open/standard alternatives.

trasz3y ago

It’s obviously proprietary: it’s non-standard and specific to a single vendor.

What is the whole idea, though? Serving things to kernel from userland is decades old and commonly used with both NFS and iSCSI. The fact that this particular implementation uses io_uring instead of something non-proprietary like RDMA, is just an implementation detail.

2 more replies

loeg3y ago

Why do you say proprietary?

j / k navigate · click thread line to collapse

38 comments

jmillikin3y ago

I wonder if this could replace most uses of NBD (network block devices), and/or help get iSCSI into userspace where more flexible load-balancing policy can be implemented.

[0] https://github.com/acozzette/BUSE

[1] https://dspace.cuni.cz/bitstream/handle/20.500.11956/148791/...

[2] https://dl.acm.org/doi/10.1145/3456727.3463768

benlwalker3y ago

notacoward3y ago

rwmj3y ago

loeg3y ago

Is BUSE significantly different from CUSE (“character device”)?

https://lwn.net/Articles/308445/

jmillikin3y ago

I've also heard the kernel's block layer error handling doesn't interact well with the FUSE protocol, but I don't know the details too well on that.

1 more reply

ice33y ago

Somehow this reminded me of this post on LKLM:

[*] https://lkml.org/lkml/2005/3/2/247

It's really interesting to see Linux getting more and more micro-kernel like features throughout the years.

coolspot3y ago

Notably, the post is from a “decade or two” ago, so timeline matches.

benlwalker3y ago

Basically, you can implement a virtual SAN for containers efficiently with this.

topspin3y ago

[1] https://github.com/NetApp/trident

lifty3y ago

Very appealing. Do you think this solution would be comparable in performance with an in-kernel storage driver?

Joker_vD3y ago

That reminds me quite strongly of VirtIO (block) devices... and yet the actual command format is, of course, different. Why can't we stop re-inveting things over and over?

jmillikin3y ago

Joker_vD3y ago

1 more reply

bullen3y ago

Is anything like this for networking done or in the works?

jmillikin3y ago

For networking the closest equivalent would be TUN/TAP, which lets userspace route either IP packets (TUN) or Ethernet frames (TAP).

dundarious3y ago

bullen3y ago

Is there no way to avoid kernel copying all network data?

I understand the frustration of having the network driver crash but could it not be run in a way that it doesn't bring down the OS?

It seems to me Java would have a no-brainer advantage of a user-space networking option since you're already in a VM!?

When I saturate my HTTP server the kernel takes 30% of the CPU just copying data for no good reason?!

1 more reply

sanxiyn3y ago

There is XDP.

tptacek3y ago

XDP is kind of the opposite of this, right? It's moving userland code into the kernel.

1 more reply

stargrazer3y ago

yes, i think there is example code where io_ring is used to get blocks into and out of XDP/kernel.

trasz3y ago

So it’s essentially like userspace iSCSI server, but proprietary?

notacoward3y ago

trasz3y ago

It’s obviously proprietary: it’s non-standard and specific to a single vendor.

2 more replies

loeg3y ago

Why do you say proprietary?

j / k navigate · click thread line to collapse