But if you know you're not going to have endianness problems, you can just skip that step entirely.
- mmap/io_uring/drivers and additional "zero-copy" code implementations require consideration about byte order.
- filesystems, databases, network applications can be high throughput and will certainly benefit from being zero-copy (with benefits anywhere from +1% to +2000% in performance.)
This is absolutely not "premature optimization." If you're a C/C++ engineer, you should know off the top of your head how many cycles syscalls & memcpys cost. (Spoiler: They're slow.) You should evaluate your performance requirements and decide if you need to eliminate that overhead. For certain applications, if you do not meet the performance requirements, you cannot ship.
People were understandably concerned that we had fucked up in the feasibility phase of the project. Lots of people get themselves in trouble this way, and this was a 9 figure piece of hardware sitting idle while our app picked its nose crunching data, if we didn't finish our work on time during maintenance windows.
But I was on my longest hot streak of accurate perf estimates in my career and this one was not going to be my Icarus moment. It ended being tweaks needed from the compiler writer and from Wind River (DMA problem). I had to spend a lot of social capital on all of this, especially the Wind River conference call (which took ten minutes for them to come around to my suggestion for a fix that they shipped us in a week. After months and months of begging for a conference call).
This article is a good demonstration of the performance improvements via mmap zero-copy: https://medium.com/@kaixin667689/zero-copy-principle-and-imp...
Netflix also relies on zero-copy via kTLS & zero-copy TLS to serve 400Gbps: https://papers.freebsd.org/2021/eurobsdcon/gallatin-netflix-...
However, the performance gap can get even larger! (The kernel is historically not great at this.) For NVME & packet processors, you can see an increase of 10,000%+ in performance easily via a zero-copy implementation. See: https://www.dpdk.org https://spdk.io
#include <cstdint>
uint32_t read_le_uint32(const uint8_t* p)
{
return p[0] | (p[1] << 8) | (p[2] << 16) | (p[3] << 24);
}
ends up as read_le_uint32(unsigned char const*):
mov eax, dword ptr [rdi]
ret
This works with Clang and gcc on x86_64 (but not with MSVC).uint8_t *buf = ...; struct example_payload *payload = (struct example_payload *) buf;
That's why when you access the variables you need to byte order swap. This is absolutely not portable, I agree. I also agree that it is error-prone. However, it is the reality of a lot of performance critical software.
If you are going zero copy, you either need to give up on any kind of portability, or delve deep into compiler flags to standardize struct layout.
if it's little endian (on the wire), the process would be like:
(value[0] | (value[1] << 8) | (value[2] << 16) | (value[3] << 24))
and in big endian (again, on the wire, architecture endianness irrelevant) it would be the same thing with the indices reversed, where "value" is the 4 bytes read in off the wire?Uh...
Compared to doing nothing, yes it's "slow."
"htonl, htons, ntohl, ntohs - convert values between host and network byte order"
The cheapest big-endian modern device is a Raspberry Pi running a NetBSD "eb" release, for those who want to test their code.
He even has an example where he just pushes the problem off to someone else "if the people at Adobe wrote proper code to encode and decode their files", yeah hope they weren't ignoring byte order issues.
That key insight is that people shouldn't try to optimize the case where the data stream's byte order happens to match the machine's byte order. That's both premature optimization and a recipe for bugs. Just don't worry about that case.
Load binary data one byte at a time and use shifts and ORs to compose the larger unit based on the data's byte order. That's 100% portable without any #ifdefs for the machine's byte order.
Original thread w/104 comments:
And do not define any data format to be big endian anymore. Deine it as little endian (do not leave it undefined) and everyone will be happy.
So it's not even all networking... and "network byte order" will mess you up.
Given a reader (file, network, buffers can all be turned into readers), you can call readInt. It takes the type you want, and the endianess of the encoding. It's easy to write, self documents, and it's highly efficient.
But if we're talking about a struct or an array, if you're byte-order aware you can do things like memcpy the whole thing around that you couldn't do by assembling it out of individual readInt calls.
As for reading structs, that's supported too: https://ziglang.org/documentation/master/std/#std.io.Reader....
readStructEndian will read the struct into memory, and perform the relevant byte swaps if the machine's endianness doesn't match the data format. No need to manually specify how a struct is supposed to perform the byte swap, that's all handled automatically (and efficiently) by comptime.
Rust, for example has from_be_bytes(), from_le_bytes() and from_ne_bytes() methods for the number primitives u16, i16, u32, and so on. They all take a byte array of the correct length and interpret them as big, little and native endian and convert them to the number.
The first two methods work fine on all architectures, and that's what this article is about.
The third method, however, is architecture-dependent and should not be used for network data, because it would work differently and that's what you don't want. In fact, let me cite this part from the documentation. It's very polite but true.
> As the target platform’s native endianness is used, portable code likely wants to use from_be_bytes or from_le_bytes, as appropriate instead.
Two areas I find it does matter: Assembly language where bytes are parsed or sorted, or transformed in some way by code that writes words
, and
binary file representations when written on a little endian machine and read by a big endian machine.
As a non-SWE, whenever I see checkboxes to enable options that maximize compatibility, I often assume there’s an implicit trade-off, so if it isn’t checked by default, I don’t enable such things unless strictly necessary. I don’t have any solid reason for this, it’s just my intuition. After all, if there were no good reasons not to enable Mac compatibility, why wouldn’t it be the default?
Edit: spelling error with “implicit”
Also, a lot of comments in this thread have nothing to do with the article and appear to be responses to some invisible strawman.
It's true as far as it goes, but (1) it leans very heavily on the compiler understanding what you're doing and "un-portabilifying" your code when the native byte order matches the file format and (2) it presumes you're working with pickled "file" formats you "stream" in via bytes and not e.g. on memory mapped regions (e.g. network packets!) that want naturally to be inspected/modified in place.
It's fine advice though for the 90% of use cases. The author is correct that people tend to tie themselves into knots needlessly over this stuff.