Linux file write patterns: So you want to write to a file fast (opens in new tab)

(blog.plenz.com)

120 pointsnoqqe12y ago50 comments

50 comments

It's 2014; why was the author using ext3 instead of ext4? Ext4 does have fallocate support.

Also, if you use fallocate(2) instead of posix_fallocate(3), you don't have to worry about glibc trying to emulate fallocate() for those file systems which don't use it.

Finally, it's a little surprising the author didn't try using O_DIRECT writes.

dekhn12y ago

Most people who use O_DIRECT writes stop quickly, thinking it's "slow". What's actually happening is you're seeing what the system is actually capable of in terms of write bandwidth, without any of the 'clever' optimizations like write caching.

StillBored12y ago

I don't think this is accurate. We have a kernel bypass for disk operations. We use our own memory buffers, and bypass the filesystem, block, and SCSI midlayers. Our stuff is basically what O_DIRECT should be.

There are cases where we are 50% faster than O_DIRECT without any "caching". Furthermore, in high bandwidth applications (>4GB/sec) without O_DIRECT its easy to become CPU limited in the blk/midlayer so again we win.

Now that said, I haven't tried the latest blk-mq, scsi-mq, etc patches which are tuned for higher IOP rates. These patches were driven by people plugging in high performance flash arrays and discovering huge performance issues in the kernel. Still, I expect if you plug in a couple high end flash arrays the kernel is going to be the limit rather than the IO subsystem on a modern xeon.

1 more reply

zobzu12y ago

but thats also why his tests are unreliable in this case

Nican12y ago

From a Boston Linux Usergroup discussion: https://www.mail-archive.com/discuss@blu.org/msg08490.html

tlb12y ago

The code in the second example is wrong. If a write partially succeeds, instead of writing the remaining part it writes again from the beginning of the buffer. The resulting file will be incorrect. That doesn't normally happen on disk writes, but it does when writing to a pipe.

mtdewcmu12y ago

It's a little bit reassuring that there weren't any clear winners and losers. In a perfect world, the OS and hardware would figure out what your intent is and carry it out the fastest way possible, right? Ideally, you'd write the code the most convenient way and it would run the most performant way. Maybe the future is now.

rwmj12y ago

A bit surprising (considering he started off talking about coredumps) that he doesn't mention sparse files. Core dumps can be very sparse, and you might save time and definitely will save space by not writing out the all-zeroes parts.

lukesandberg12y ago

to do that, wouldn't you have to look at every byte just to detect the runs of 0s, that would mean that you have to pull the whole file through the memory hierarchy of your system (rather than just passing chunks from syscall to syscall) wouldn't that alone slow you down significantly?

rwmj12y ago

It depends. If the data is coming from a pipe (like core_pattern) then yes you have to check for runs of zeroes. If it's coming from a filesystem, then there are various system calls that let you skip them (specifically SEEK_HOLE and SEEK_DATA flags of lseek(2)).

Also if the data is being copied into userspace anyway, then it's quite fast to check that memory is zero. There's no C "primitive" for this, but all C compilers can turn a simple loop into relatively efficient assembler[1].

If you're using an API that never copies the data into userspace and you have to read from a pipe, then yes sparse detection will be much more expensive.

In either case it should save disk space for core files which are highly sparse.

[1] https://stackoverflow.com/a/1494021

1 more reply

dekhn12y ago

Right. Sparse files are normally written by applications or kernel threads that specifically know the defined byte ranges, and define new allocated parts of the file. Further, file allocations are probably block-sized, so you would need to ensure the byte regions of blocks were all zero.

This could be done quickly in the kernel. RAID (which does pass the data through multiple transformations) subsystem metrics printed at boot demonstrate that.

flogic12y ago

The first hit for "SSD write speeds" says ~500Mb/s (I hope I got the right b). I didn't bother clicking the links that just the blurb under it. He's dumping 128Mb in ~200 ms. I'm not sure there is much room for improvement.

dekhn12y ago

MB. Bytes. nobody quotes disk speeds in bits (or if they do I typically ignore them).

I've frequently observed sustained 500MB/sec writes and reads on my cheap ($250) 250GB SSDs. One of my favorite instances was running out of RAM while assembling a gigapan in Hugin. I added a swap file on my SSD and continued- it ran over night with nearly 500MB/sec reads and writes more or less continuously, but the job finished fine.

Phlarp12y ago

I weep for the memory sectors that got re-written continuously for an entire night.

3 more replies

rsynnott12y ago

Depends on the SSD. The PCIe SSD in a 2013 Retina MBP can approach 1GB/sec, and of course high-end PCIe server stuff can do better again; you may also have a striped RAID setup.

masklinn12y ago

> The first hit for "SSD write speeds" says ~500Mb/s (I hope I got the right b).

Nope, it's MB not Mb.

dekhn12y ago

Am I correct in noting that none of his benchmarks actually timed how long it took for the data to be durably committed to disk, to the limit that the OS can report that?

I would never do XFS benchmarks because in my experience if XFS is writing during powerdown, it trashses the FS (maybe this was fixed in the past 6 years, but after it happened 3 times I haven't touched the OS again).

Wilya12y ago

One of the most catastrophical failure modes of XFS was fixed in 2007 [0]. Or at least that's what is said. I never dared touch XFS again after losing a fs to it, so I can't really confirm what it looks like today.

[0] https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux....

maxhou12y ago

actually since none of its tests ask to sync the data onto the disk, he might just be measuring each method efficiency in creating dirty pages.

of course that depends on the amount of RAM the system has, and how the kernel VM parameters are tuned (sysctl vm.dirty_*)

just add a fdatasync() call and you will take into account the time it takes to flush all dirty pages into the disk.

1 more reply

asveikau12y ago

Assuming all I/O failures are EINTR is really, really odd, as if to say disks never fill up or fail and sockets never disconnect.

He does say:

> in a real program you’d have to do real error handling instead of assertions, of course

But somebody somewhere is reading this and thinking this is a "semantically correct pattern" (as it is introduced) and may just copy-paste it into their program. Especially when contrasting it with a "wrong way" I think it wouldn't hurt to include real error handling. And that means something that doesn't fall into an infinite loop when the disk fills up.

masklinn12y ago

> Assuming all I/O failures are EINTR is really, really odd, as if to say disks never fill up or fail and sockets never disconnect.

The point is to retry on EINTR and to abort completely in case of other IO failures.

    assert(errno == EINTR);
    continue;

is equivalent to

    if (errno == EINTR)
        continue;
    abort();

> But somebody somewhere is reading this and thinking this is a "semantically correct pattern" (as it is introduced) and may just copy-paste it into their program.

Even if they do, it likely will not actually do any harm, it'll just kill the program instead of gracefully handle error.

asveikau12y ago

You are wrong. assert is a no-op when NDEBUG is defined. Some compilers will set that for you in an optimized build.

Using an assert in place of real error checking or otherwise relying on its side effects is consequently a huge wtf in C.

1 more reply

maxhou12y ago

ENOSPC ?

1 more reply

Niten12y ago

> But somebody somewhere is reading this and thinking this is a "semantically correct pattern" (as it is introduced) and may just copy-paste it into their program.

I dare say that would be their fault for blindly copying and pasting without taking the time to understand the context. (He even gives an explicit disclaimer!) Robust error handling would just be more noise to filter through for people actually reading the article, and I don't think it's the author's responsibility to childproof things for people who aren't.

asveikau12y ago

I'd agree, except that I've seen too many examples where people, particularly those who still have things to learn, cite blog snippets as authoritative. IMO we have something of a duty to those people to get it right. In this case it's not a lot of effort to get it right. Relying on the side-effects of an assert() is not getting it right.

The fact that I got a reply based on a misunderstanding of how asserts work tells me it's a point that needs to be made.

1 more reply

cjensen12y ago

The author has failed to account for command latency. If you write some bytes, there are a bunch of hardware buffering delays in getting bytes to disk including seek and rotational latency.

Async I/O avoids this. You can tell the I/O subsystem what you want to read next even while doing a write. The I/O is posted to the disk in modern systems, and the disk will begin seeking to the read site in parallel with informing the OS that the write has completed. Posting I/O even helps for SSDs to avoid the idle time on the SSD media between write done and read start.

mtdewcmu12y ago

I think there is some amount of write back caching in the kernel so that the application doesn't have to wait for each individual chunk to go to disk before it can submit the next chunk. I believe there's a sync on either file close or process termination, or some combination.

angry_octet12y ago

"For simplicity I’ll try things on my laptop computer with Ext3+dmcrypt and an SSD. This is “read a 128MB file and write it out”, repeated for different block sizes, timing each" The whole thing is completely invalid for measuring actual I/O hierarchy efficiencies because of (a) write sizes too small, would be in buffer cache of unknown hotness, (b) dmcrypt introduces a whole layer of indirection and timing variability and (c) on an SSD, almost anything could be happening regarding cache and syncs. Also, mount options, % disk used, small sample sizes, unknown contention effects, etc. This is a good example of how to convince yourself of something and yet be less accurate than a divining rod.

callesgg12y ago

I would assume the encryption is such a big overhead that most optimizations in the upper levels will be useless.

petermonsson12y ago

The Intel AES instructions help a lot. According to Wikipedia we use 3.5 cycles/byte. That gives us 128*3.5/3 = 149 ms. If the write speed is around 500mb/a as another potter stated, then the disk encryption is probably not a bottleneck. Still, it is better to actually measure the performance without encryption to see if there is any effect.

dar891912y ago

The second code snippet looks wrong,

write(out, buf, (r - w)) should be write(out, buf + w, r - w)

zobzu12y ago

actually despite authors claims the fs stuff is ram cached a lot, hence the differences in the tests. (specially for a single file write)

j / k navigate · click thread line to collapse

50 comments

tytso12y ago

It's 2014; why was the author using ext3 instead of ext4? Ext4 does have fallocate support.

Also, if you use fallocate(2) instead of posix_fallocate(3), you don't have to worry about glibc trying to emulate fallocate() for those file systems which don't use it.

Finally, it's a little surprising the author didn't try using O_DIRECT writes.

dekhn12y ago

StillBored12y ago

1 more reply

zobzu12y ago

but thats also why his tests are unreliable in this case

Nican12y ago

From a Boston Linux Usergroup discussion: https://www.mail-archive.com/discuss@blu.org/msg08490.html

tlb12y ago

mtdewcmu12y ago

rwmj12y ago

lukesandberg12y ago

rwmj12y ago

If you're using an API that never copies the data into userspace and you have to read from a pipe, then yes sparse detection will be much more expensive.

In either case it should save disk space for core files which are highly sparse.

[1] https://stackoverflow.com/a/1494021

1 more reply

dekhn12y ago

This could be done quickly in the kernel. RAID (which does pass the data through multiple transformations) subsystem metrics printed at boot demonstrate that.

flogic12y ago

dekhn12y ago

MB. Bytes. nobody quotes disk speeds in bits (or if they do I typically ignore them).

Phlarp12y ago

I weep for the memory sectors that got re-written continuously for an entire night.

3 more replies

rsynnott12y ago

Depends on the SSD. The PCIe SSD in a 2013 Retina MBP can approach 1GB/sec, and of course high-end PCIe server stuff can do better again; you may also have a striped RAID setup.

masklinn12y ago

> The first hit for "SSD write speeds" says ~500Mb/s (I hope I got the right b).

Nope, it's MB not Mb.

dekhn12y ago

Am I correct in noting that none of his benchmarks actually timed how long it took for the data to be durably committed to disk, to the limit that the OS can report that?

Wilya12y ago

[0] https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux....

maxhou12y ago

actually since none of its tests ask to sync the data onto the disk, he might just be measuring each method efficiency in creating dirty pages.

of course that depends on the amount of RAM the system has, and how the kernel VM parameters are tuned (sysctl vm.dirty_*)

just add a fdatasync() call and you will take into account the time it takes to flush all dirty pages into the disk.

1 more reply

asveikau12y ago

Assuming all I/O failures are EINTR is really, really odd, as if to say disks never fill up or fail and sockets never disconnect.

He does say:

> in a real program you’d have to do real error handling instead of assertions, of course

masklinn12y ago

> Assuming all I/O failures are EINTR is really, really odd, as if to say disks never fill up or fail and sockets never disconnect.

The point is to retry on EINTR and to abort completely in case of other IO failures.

    assert(errno == EINTR);
    continue;

is equivalent to

    if (errno == EINTR)
        continue;
    abort();

> But somebody somewhere is reading this and thinking this is a "semantically correct pattern" (as it is introduced) and may just copy-paste it into their program.

Even if they do, it likely will not actually do any harm, it'll just kill the program instead of gracefully handle error.

asveikau12y ago

You are wrong. assert is a no-op when NDEBUG is defined. Some compilers will set that for you in an optimized build.

Using an assert in place of real error checking or otherwise relying on its side effects is consequently a huge wtf in C.

1 more reply

maxhou12y ago

ENOSPC ?

1 more reply

Niten12y ago

> But somebody somewhere is reading this and thinking this is a "semantically correct pattern" (as it is introduced) and may just copy-paste it into their program.

asveikau12y ago

The fact that I got a reply based on a misunderstanding of how asserts work tells me it's a point that needs to be made.

1 more reply

cjensen12y ago

The author has failed to account for command latency. If you write some bytes, there are a bunch of hardware buffering delays in getting bytes to disk including seek and rotational latency.

mtdewcmu12y ago

angry_octet12y ago

callesgg12y ago

I would assume the encryption is such a big overhead that most optimizations in the upper levels will be useless.

petermonsson12y ago

dar891912y ago

The second code snippet looks wrong,

write(out, buf, (r - w)) should be write(out, buf + w, r - w)

zobzu12y ago

actually despite authors claims the fs stuff is ram cached a lot, hence the differences in the tests. (specially for a single file write)

j / k navigate · click thread line to collapse