Also, if you use fallocate(2) instead of posix_fallocate(3), you don't have to worry about glibc trying to emulate fallocate() for those file systems which don't use it.
Finally, it's a little surprising the author didn't try using O_DIRECT writes.
There are cases where we are 50% faster than O_DIRECT without any "caching". Furthermore, in high bandwidth applications (>4GB/sec) without O_DIRECT its easy to become CPU limited in the blk/midlayer so again we win.
Now that said, I haven't tried the latest blk-mq, scsi-mq, etc patches which are tuned for higher IOP rates. These patches were driven by people plugging in high performance flash arrays and discovering huge performance issues in the kernel. Still, I expect if you plug in a couple high end flash arrays the kernel is going to be the limit rather than the IO subsystem on a modern xeon.
Also if the data is being copied into userspace anyway, then it's quite fast to check that memory is zero. There's no C "primitive" for this, but all C compilers can turn a simple loop into relatively efficient assembler[1].
If you're using an API that never copies the data into userspace and you have to read from a pipe, then yes sparse detection will be much more expensive.
In either case it should save disk space for core files which are highly sparse.
This could be done quickly in the kernel. RAID (which does pass the data through multiple transformations) subsystem metrics printed at boot demonstrate that.
I've frequently observed sustained 500MB/sec writes and reads on my cheap ($250) 250GB SSDs. One of my favorite instances was running out of RAM while assembling a gigapan in Hugin. I added a swap file on my SSD and continued- it ran over night with nearly 500MB/sec reads and writes more or less continuously, but the job finished fine.
Nope, it's MB not Mb.
I would never do XFS benchmarks because in my experience if XFS is writing during powerdown, it trashses the FS (maybe this was fixed in the past 6 years, but after it happened 3 times I haven't touched the OS again).
[0] https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux....
of course that depends on the amount of RAM the system has, and how the kernel VM parameters are tuned (sysctl vm.dirty_*)
just add a fdatasync() call and you will take into account the time it takes to flush all dirty pages into the disk.
He does say:
> in a real program you’d have to do real error handling instead of assertions, of course
But somebody somewhere is reading this and thinking this is a "semantically correct pattern" (as it is introduced) and may just copy-paste it into their program. Especially when contrasting it with a "wrong way" I think it wouldn't hurt to include real error handling. And that means something that doesn't fall into an infinite loop when the disk fills up.
The point is to retry on EINTR and to abort completely in case of other IO failures.
assert(errno == EINTR);
continue;
is equivalent to if (errno == EINTR)
continue;
abort();
> But somebody somewhere is reading this and thinking this is a "semantically correct pattern" (as it is introduced) and may just copy-paste it into their program.Even if they do, it likely will not actually do any harm, it'll just kill the program instead of gracefully handle error.
Using an assert in place of real error checking or otherwise relying on its side effects is consequently a huge wtf in C.
I dare say that would be their fault for blindly copying and pasting without taking the time to understand the context. (He even gives an explicit disclaimer!) Robust error handling would just be more noise to filter through for people actually reading the article, and I don't think it's the author's responsibility to childproof things for people who aren't.
The fact that I got a reply based on a misunderstanding of how asserts work tells me it's a point that needs to be made.
Async I/O avoids this. You can tell the I/O subsystem what you want to read next even while doing a write. The I/O is posted to the disk in modern systems, and the disk will begin seeking to the read site in parallel with informing the OS that the write has completed. Posting I/O even helps for SSDs to avoid the idle time on the SSD media between write done and read start.
write(out, buf, (r - w)) should be write(out, buf + w, r - w)