The idea is a syscall for getting a ringbuffer for any supported file descriptor, including pipes - and for pipes, if both ends support using the ringbuffer they'll map the same ringbuffer: zero copy IO, potentially without calling into the kernel at all.
Would love to find collaborators for this one :)
Is there planned to be a standardized way to signal to the other end of the pipe that ring buffers are supported, so this could be handled transparently in libc? If not, I don't really see what advantage it gets you compared to shared memory + a futex for synchronization—for pipes that is.
It is different from a pipe - instead of using read/write to copy data from/to a kernel buffer, it gives user space a mapped buffer object and they need to take care to use it properly (using atomic operations on the head/tail and such).
If you own the code for the reader and writer, it's like using shared memory for a buffer. The proposal is about standardizing an interface.
This is caused by the CONFIG_RETHUNK option. In the disassembly from objdump you are seeing the result of RET being replaced with JMP __x86_return_thunk.
https://github.com/torvalds/linux/blob/v6.1/arch/x86/include...
https://github.com/torvalds/linux/blob/v6.1/arch/x86/lib/ret...
> The NOP instructions at the beginning and at the end of the function allow ftrace to insert tracing instructions when needed.
These are from the ASM_CLAC and ASM_STAC macros, which make space for the CLAC and STAC instructions (both of them three bytes in length, same as the number of NOPs) to be filled in at runtime if X86_FEATURE_SMAP is detected.
https://github.com/torvalds/linux/blob/v6.1/arch/x86/include...
https://github.com/torvalds/linux/blob/v6.1/arch/x86/include...
https://github.com/torvalds/linux/blob/v6.1/arch/x86/kernel/...
1. would know the above
2. would choose such an obnoxious throwaway handle
Because of that, it is economical to spend lots of time optimizing it, even if it only makes the code marginally more efficient.
Pipes aren't used everywhere in production in hot paths. That just doesn't happen.
If 100 million people each save 1 cent because of your work, you saved 1 million in total, but in practice nobody is observably better off.
I've used pipes for a lot of stuff over 10+ years, and never noticed being limited by the speed of the pipe, I'm almost certain to be limited by tar, gzip, find, grep, nc ... (even though these also tend to be pretty fast for what they do).
1. Logging. At first our tools for reading the logs from a filesystem management program were using pipes, but they would be overwhelmed quickly (even before it would overwhelm pagers and further down the line). We had to write our own pager and give up on using pipes.
2. Storage again, but a different problem: we had a setup where we deployed SPDK to manage the iSCSI frontend duties, and our component to manage the actual storage process. It was very important that the communication between these two components be as fast and as memory-efficient as possible. The slowness of pipes comes also from the fact that they have to copy memory. We had to extend SPDK to make it communicate with our component through shared memory instead.
So, yeah, pipes are unlikely to be the bottleneck of many applications, but definitely not all.
Lets not get carried away. You can use ffmpeg as a library and encode buffers in a few dozen lines of C++.
It's clumsier, to be sure, but if performance is your goal, the socket should be faster.
Donald Knuth thinks the same: https://en.wikipedia.org/wiki/Program_optimization#When_to_o...
https://www.toyota.com/grcorolla/
(These machines have amazing engineering and performance, and their entire existence is a hack to work around rules making it unviable to bring the intended GR Yaris to the US market.. Maybe just enough eng/perf/hack/market relevance to HN folk to warrant my lighthearted reply. Also, the company president is still on the tools.
Suppose you're cycling on the lines of stdout and need to use sed, cut and so on, using pipes will slow down things considerably (and sed, cut startup time will make things worse).
Using bash/zsh string interpolation would be much faster.
Also, why leave performance on the table by default? Just because “it should be enough for most people I can think of”?
Add Tesla motors to a Toyota Corolla and now you’ve got a sportier car by default.
it's not optimizing footprint or speed of application. it's optimizing the resources and speed of development and deployment
All thresholds are described in https://codebrowser.dev/glibc/glibc/sysdeps/x86_64/multiarch...
And they are not final, i. e. Noah Goldstein still updates them every year.
https://github.com/llvm/llvm-project/blob/main/libc/src/stri...
On a Zen 3 CPU, "rep movsb" becomes faster than or the same as anything else above a length slightly greater than 2 kB.
However there is a range of multi-megabyte lengths, which correspond roughly with sizes below the L3 cache but exceeding the L2 cache, where for some weird reason "rep movsb" becomes slower than SIMD non-temporal stores.
At lengths exceeding the L3 size, "rep movsb" becomes again the fastest copy method.
The Intel CPUs have different behaviors.
On my Zen 3 CPU, for lengths of 2 kB or smaller it is possible to copy faster than with "rep movsb", but by using SIMD instructions (or equivalently the builtin "memcpy" provided by most C compilers), not with a C loop (unless the compiler recognizes the C loop and replaces it with the builtin memcpy, which is what some compilers will do at high optimization levels).
[1] https://www.intel.com/content/dam/www/central-libraries/us/e...
[2] https://www.intel.com/content/www/us/en/developer/articles/t...
Specifically, there aren't many reasons for your fastest IPC to be slower than a long function call.
Saying "long function call" doesn't mean much since a function can take infinitely long.
A possible answer that's currently just below your comment: https://news.ycombinator.com/item?id=41351870
> vmslice doesn't work with every type of file descriptor.
Looks like an amazing article, and so much to learn on what happens under the hood
I don't mean this as a slight to anyone, I just want to point out the HN "hug of death" can be trivially handled by a single cheap VPS without even breaking a sweat.
Anyway, nice article, its good to know whats going on under the hood.
https://cygwin.com/pipermail/cygwin-patches/2016q1/008301.ht...
But still, kudos for Cygwin Developers for creating Cygwin :) Great work, even tho it have some issues.
In my experience in data engineering, it’s very unlikely you can exceed 500mb/s throughput of your business logic as most libraries you’re using are not optimized to that degree (SIMD etc.). That being said I think it’s a good technique to try out.
I’m trying to think of other applications this could be useful for. Maybe video workflows?
The jump seems generated by the expansion of the `ASM_CLAC` macro, which is supposed to change the EFLAGS register ([1], [2]). However in this case the expansion looks like it does nothing (maybe because of the target ?). I 'd be interested to know more about that. Call to the wild.
[1]: https://github.com/torvalds/linux/blob/master/arch/x86/inclu...
https://jvns.ca/blog/2017/03/19/getting-started-with-ftrace/
I think you need to recompile your compiler, or disable those explicitly via link / cc flags. Compilers are fairly hard to get to coax / dissuade SIMD instructions, IMHO.
For the data transfer rate it doesn't matter how (using which language) the pipe is established; C and Rust and the like will have a (small) edge up in the start-up time (latency) though.
https://linux.die.net/man/1/pv
it is in the pipe command `... | pv > /dev/null`
% pv </dev/zero >/dev/null
54.0GiB/s
% pv </dev/zero --discard
58.7GiB/sthe only time ive used them is external constraints. they are just not useful.
vmslice doesn't work with every type of file descriptor. eschewing some technology entirely because it seems archaic or because it makes writing "the fastest X software" seem harder is just sloppy engineering.
> they are just not useful.
Then you have not written enough software yet to discover how they are useful.
Nothing ever touches those pages on the consumer side and they can be refused immediately.
If you actually want a functional program using vmsplice, with a real consumer, things get hairy very quickly.
Sure, you could build that box with glue and clamps and ample time, sure it would look neater and weigh less than the version that's currently holding you imprisoned and if done right it will even be stronger but it takes more time and effort as well as those glue clamps and other specialised tools to create perfect matching surfaces while the builder just wielded that hammer and those nails and now is building yet another utilitarian piece of work with the same hammer and nails.
Sometimes all you need is a hammer and some nails. Or pipes.
It's incredibly valuable on the day to day.
If you dislike their (relative) slowness, it's open source, you can participate in making them faster.
And I'm sure that after this HN post we'll see some patches and merge requests.
Personally I think there's much worse ugliness in POSIX than pipes. For example, I've just spent the last couple of days debugging a number of bugs in a shell's job control code (`fg`, `bg`, `jobs`, etc).
But despite its warts, I'm still grateful we have something like POSIX to build against.
In fact if you ever set O_NONBLOCK on a pipe you need to be damn sure both the reader and writer expect non-blocking i/o because you'll get heisenbugs under heavy i/o when either the reader/writer outpace each other and one expects blocking i/o. When's the last time you checked the error code of `printf` and put it in a retry loop?
Not sure what printf has to do with, it isn't designed to be used with a non-block writer (but that only concerns one side). How will the reader being non-block change the semantics of the writer? It doesn't.
You can't set O_NONBLOCK on a pipe fd you expect to use with stdio, but that isn't unique to pipes. Whether the reader is O_NONBLOCK will not affect you if you're pushing the writer with printf/stdio.
(This is also a reason why I balk a bit when people refer to O_NONBLOCK as "async IO", it isn't the same and leads to this confusion)