- sendfile() cannot be done with QUIC, since the QUIC stack runs in userspace. That means that data must be read into kernel memory, copied to the webserver's memory, then copied back into the kernel, then sent down to the NIC. Worse, if crypto is not offloaded, userspace also needs to encrypt the data.
- LSO/LRO are (mostly) not implemented in hardware for QUIC, meaning that the NIC is sent 1500b packets, rather than being sent a 64K packet that it segments down to 1500b.
- The crypto is designed to prevent MiTM attacks, which also makes doing NIC crypto offload a lot harder. I'm not currently aware of any mainstream (eg, not an FPGA by a startup) that can do inline TLS offload for QUIC.
There is work ongoing by a lot of folks to make this better. But at least for now, on the server side, Quic is roughly an order of magnitude less efficient than TCP.
I did some experiments last year for a talk I gave which approximated loosing the optimizations above. https://people.freebsd.org/~gallatin/talks/euro2022.pdf For a video CDN type workload with static content, we'd go from being about to serve ~400Gb/s per single-core AMD "rome" based EPYC (with plenty of CPU idle) to less than 100Gb/s per server with the CPU maxed out.
Workloads where the content is not static and has to be touched already in userspace, things won't be so comparatively bad.
Huh? Surely what you're doing in the accelerated path is just AES encryption/ decryption with a parameterised key which can't be much different from TLS?
Then there’s the userspace work for assembling and encrypting all these tiny packets individually, and looking up the right datastructures (connections, streams).
And there’s challenges load balancing the load of multiple Quic connections or streams across CPU cores. If only one core dequeues UDP datagrams for all connections on an endpoint then those will be bottlenecked by that core - whereas for TCP the kernel and drivers can already do more work with multiple receive queues and threads. And while one can run multiple sockets and threads with port reuse, it poses other challenges if a packet for a certain connection gets routed to the wrong thread due to connection migration. Theres also solutions for that - eg in the form of sophisticated eBPF programs. But they require a lot of work and are hard to apply for regular users that just want to use QUIC as a library.