The Byte Order Fallacy (opens in new tab)

(commandcenter.blogspot.com)

45 pointscodesuki1y ago58 comments

58 comments

IME, there's one big thing that often keeps my programs from being unaffected by byte order: wanting to quickly splat data structures into and out of files, pipes, and sockets, without having to encode or decode each element one-by-one. The only real way to make this endian-independent is to have byte-swapping accessors for everything when it's ultimately produced or consumed, but adding all the code for that is very tedious in most languages. One can argue that handling endianness is the responsible thing to do, but it just doesn't seem worthwhile when I practially know that no one will ever run my code on a big-endian processor.

advisedwang1y ago

This is functionally identical to the author's example - the file has a defined byte order and you have a choice of doing byte swapping or just explicitly writing out the bytes in the defined order. The author is saying your goal of avoiding "having to encode or decode each element one-by-one" a misguided optimization.

rocqua1y ago

Byte swapping is equivalent to needing to do encoding and decoding. Is it not?

LegionMammal9781y ago

The benefit is that you'd only have to do it for the parts of the data that are actively manipulated, which might be far less than the entirety of the data structure. Also, you can easily forward a copy elsewhere in the original format.

But if you know you're not going to have endianness problems, you can just skip that step entirely.

GMoromisato1y ago

I think the article's author would say that loading data "without having to encode or decode each element" is premature optimization and more likely to have bugs. I tend to agree.

dwattttt1y ago

The optimisation the parent is referring to is development time/effort; if the alternative to dumping a structure to a file is to hand roll your serialiser/deserialiser, that's a slower & probably more error prone approach (depending on the context).

1 more reply

sobellian1y ago

Maybe if you hand-roll the struct layout, but if you use something like flatbuffers I doubt you would see many more bugs - and flatbuffers will take care of endian swaps as necessary without you needing to think about it.

sgarland1y ago

Depends what you’re doing. I have a side project that generates CSVs in the GB range. It keeps everything in bytes because encode/decode is a lot of overhead in loops when you’re hitting them millions of times.

LegionMammal9781y ago

Not once you start getting into the range of hundreds of megabytes or more, which accounts for most situations where I'd use a binary format in the first place.

1 more reply

iscoelho1y ago

If you are using C/C++ for any new app, there is a possibility you are writing code that has a performance requirement.

- mmap/io_uring/drivers and additional "zero-copy" code implementations require consideration about byte order.

- filesystems, databases, network applications can be high throughput and will certainly benefit from being zero-copy (with benefits anywhere from +1% to +2000% in performance.)

This is absolutely not "premature optimization." If you're a C/C++ engineer, you should know off the top of your head how many cycles syscalls & memcpys cost. (Spoiler: They're slow.) You should evaluate your performance requirements and decide if you need to eliminate that overhead. For certain applications, if you do not meet the performance requirements, you cannot ship.

hinkley1y ago

Once upon the time I became the de facto admin for a VxWorks box because my code was to be the bottleneck on a task with a min throughput defined in the requirements and we weren't hitting the numbers. I ended up having to KVM into it and run benchmarks in vivo, which meant understanding the command line which I'd never seen before.

People were understandably concerned that we had fucked up in the feasibility phase of the project. Lots of people get themselves in trouble this way, and this was a 9 figure piece of hardware sitting idle while our app picked its nose crunching data, if we didn't finish our work on time during maintenance windows.

But I was on my longest hot streak of accurate perf estimates in my career and this one was not going to be my Icarus moment. It ended being tweaks needed from the compiler writer and from Wind River (DMA problem). I had to spend a lot of social capital on all of this, especially the Wind River conference call (which took ten minutes for them to come around to my suggestion for a fix that they shipped us in a week. After months and months of begging for a conference call).

iscoelho1y ago

100% on the business implications. Although a lot of engineers never have to touch it, DMA (& zero-copy) implementations are foundational to the performance of modern day computers that we sometimes take for granted.

1 more reply

AlotOfReading1y ago

A memcpy should not be slow. It should be nearly as fast as generic memory copying can be. Most of the time you shouldn't even hit the actual function, but instead a bit of code generated by the compiler that does exactly the copy you need.

iscoelho1y ago

memcpy is extremely slow. On any high-load Linux webserver, you can type "perf top" and see 20%~ of the CPU usage consumed by memcpy/syscalls/virtual memory.

This article is a good demonstration of the performance improvements via mmap zero-copy: https://medium.com/@kaixin667689/zero-copy-principle-and-imp...

Netflix also relies on zero-copy via kTLS & zero-copy TLS to serve 400Gbps: https://papers.freebsd.org/2021/eurobsdcon/gallatin-netflix-...

However, the performance gap can get even larger! (The kernel is historically not great at this.) For NVME & packet processors, you can see an increase of 10,000%+ in performance easily via a zero-copy implementation. See: https://www.dpdk.org https://spdk.io

3 more replies

nightowl_games1y ago

Yeah ive always been blown away by how fast memcpy is. I'm guessing the OP is from a different world of engineering than I am.

neonz801y ago

The compiler can optimize this. See https://gcc.godbolt.org/z/hxW7hhrd7

  #include <cstdint>
  uint32_t read_le_uint32(const uint8_t* p)
  {
      return p[0] | (p[1] << 8) | (p[2] << 16) | (p[3] << 24);
  }

ends up as

  read_le_uint32(unsigned char const*):
          mov     eax, dword ptr [rdi]
          ret

This works with Clang and gcc on x86_64 (but not with MSVC).

iscoelho1y ago

The purpose of zero-copy can be to avoid deserialization at all. All you do to deserialize is:

uint8_t *buf = ...; struct example_payload *payload = (struct example_payload *) buf;

That's why when you access the variables you need to byte order swap. This is absolutely not portable, I agree. I also agree that it is error-prone. However, it is the reality of a lot of performance critical software.

plorkyeran1y ago

Yeah, I’ve occasionally had to manually special case big/little endian code, but most of the time you can write the generic code and the optimizer will take care of it. Unless you’re doing something very complicated it’s a quite trivial optimization to perform.

rocqua1y ago

My uses of mmap have only over been memoization. Where I didn't care about byte order, and instead just assumed the files wouldn't be portable between any two computers.

If you are going zero copy, you either need to give up on any kind of portability, or delve deep into compiler flags to standardize struct layout.

pmarreck1y ago

maybe i'm missing something because I don't code network drivers but wouldn't it be something like...

if it's little endian (on the wire), the process would be like:

    (value[0] | (value[1] << 8) | (value[2] << 16) | (value[3] << 24))

and in big endian (again, on the wire, architecture endianness irrelevant) it would be the same thing with the indices reversed, where "value" is the 4 bytes read in off the wire?

iscoelho1y ago

The performance would be absolutely horrendous if network drivers were programmed this way. DMA (Direct Memory Access) is all about avoiding deserialization and copies of the data.

paulddraper1y ago

> memcpy slow

Uh...

Compared to doing nothing, yes it's "slow."

chasil1y ago

TCP/IP is big-endian, which is likely the largest footprint for these concerns.

"htonl, htons, ntohl, ntohs - convert values between host and network byte order"

The cheapest big-endian modern device is a Raspberry Pi running a NetBSD "eb" release, for those who want to test their code.

https://wiki.netbsd.org/ports/evbarm/

Isamu1y ago

Yeah, you deal with order when marshaling stuff on the wire, I haven’t dealt with it much for years, but doing embedded software that used to be in my face a lot.

rwmj1y ago

Unless you're dealing with binary data in which case byte order matters very much and if you forget to convert it you're causing a world of pain for someone.

He even has an example where he just pushes the problem off to someone else "if the people at Adobe wrote proper code to encode and decode their files", yeah hope they weren't ignoring byte order issues.

GMoromisato1y ago

The article's point is that the machine's byte order doesn't matter. The byte order of a data stream of course matters, but they show a way to load a binary data stream without worrying about the machine's byte order.

That key insight is that people shouldn't try to optimize the case where the data stream's byte order happens to match the machine's byte order. That's both premature optimization and a recipe for bugs. Just don't worry about that case.

Load binary data one byte at a time and use shifts and ORs to compose the larger unit based on the data's byte order. That's 100% portable without any #ifdefs for the machine's byte order.

genpfault1y ago

(2012)

Original thread w/104 comments:

https://news.ycombinator.com/item?id=3796378

AstralStorm1y ago

Really except for the networking (including say Bluetooth) nobody is big endian anymore. So how about just don't leak that thing from the network layer.

And do not define any data format to be big endian anymore. Deine it as little endian (do not leave it undefined) and everyone will be happy.

butterisgood1y ago

I think both SMB and 9p (Plan 9 resource sharing/file system protocol) are little endian.

So it's not even all networking... and "network byte order" will mess you up.

Laremere1y ago

This is a reasonable way to do things, and I've used it before. However I just used Zig's method here, and like it a lot: https://ziglang.org/documentation/master/std/#std.io.Reader....

Given a reader (file, network, buffers can all be turned into readers), you can call readInt. It takes the type you want, and the endianess of the encoding. It's easy to write, self documents, and it's highly efficient.

edflsafoiewq1y ago

If we're talking about a single int, the way you do it doesn't matter, just wrap it up in a readInt function.

But if we're talking about a struct or an array, if you're byte-order aware you can do things like memcpy the whole thing around that you couldn't do by assembling it out of individual readInt calls.

wmf1y ago

It's probably faster to memcpy the thing then "swap" each element (the swaps may be no-ops under the hood). This should be portable and fast.

Laremere1y ago

Yeah it's not a hard thing to do, but I think Zig does it very cleanly.

As for reading structs, that's supported too: https://ziglang.org/documentation/master/std/#std.io.Reader....

readStructEndian will read the struct into memory, and perform the relevant byte swaps if the machine's endianness doesn't match the data format. No need to manually specify how a struct is supposed to perform the byte swap, that's all handled automatically (and efficiently) by comptime.

1 more reply

ultrahax1y ago

As a games coder I was glad when the xbox 360 / ps3 era came to an end; getting big endian clients talking to little endian servers was an endless source of bugs.

benlivengood1y ago

The other case where it matters is SIMD instructions where you're serializing or deserializing multiple fields at once, but the SIMD operations are usually architecture specific to begin with and so if you shuffle bytes into and out of the native packed formats it will be specific to the endianness of the native packed format, and then you can forget about byte order outside of those shuffle transformations.

_nalply1y ago

What he said: if you read bytes with some byte order, you compose them yourself correctly, no byte swapping but just reading byte for byte and convert them to the number value you need. The architecture byte order is implicit as long as you use the architecture's tools to convert the bytes.

Rust, for example has from_be_bytes(), from_le_bytes() and from_ne_bytes() methods for the number primitives u16, i16, u32, and so on. They all take a byte array of the correct length and interpret them as big, little and native endian and convert them to the number.

The first two methods work fine on all architectures, and that's what this article is about.

The third method, however, is architecture-dependent and should not be used for network data, because it would work differently and that's what you don't want. In fact, let me cite this part from the documentation. It's very polite but true.

> As the target platform’s native endianness is used, portable code likely wants to use from_be_bytes or from_le_bytes, as appropriate instead.

fracus1y ago

I don't like these ambiguous titles. From the title I thought I was going to read that byte order doesn't matter when in fact the title should be "a computer's byte order is irrelevant to high-level languages". At least, state the fallacy in unambiguous terms one sentence right away. In any case, was an interesting read.

ddingus1y ago

I came here to write the same. I learned a thing or two about how higher level languages work.

Two areas I find it does matter: Assembly language where bytes are parsed or sorted, or transformed in some way by code that writes words

, and

binary file representations when written on a little endian machine and read by a big endian machine.

nativeit1y ago

> If you wrote it on a PC and tried to read it on a Mac, though, it wouldn't work unless back on the PC you checked a button that said you wanted the file to be readable on a Mac. (Why wouldn't you? Seriously, why wouldn't you?)

As a non-SWE, whenever I see checkboxes to enable options that maximize compatibility, I often assume there’s an implicit trade-off, so if it isn’t checked by default, I don’t enable such things unless strictly necessary. I don’t have any solid reason for this, it’s just my intuition. After all, if there were no good reasons not to enable Mac compatibility, why wouldn’t it be the default?

Edit: spelling error with “implicit”

e4m21y ago

Be aware that if you actually want to do as the article prescribes, don't just copy and paste -- you shan't take anything at face value in C: https://news.ycombinator.com/item?id=31718292.

wmf1y ago

He's right that you shouldn't use ifdefs, but I think a macro like le32toh() is far clearer and more concise than a bunch of shifts and ors.

Also, a lot of comments in this thread have nothing to do with the article and appear to be responses to some invisible strawman.

nuancebydefault1y ago

The byte order matters in all cases where there is i/o, being files, network streams, inter chip communication,... For data that stays on the same processor or for files that are only accessed with the processors of the same endianness, there really is no issue, even when doing bit manipulation.

eternityforest1y ago

If Network Byte Order wasn't a thing, we could all just pretend big endian doesn't exist outside of mainframes.

wakawaka281y ago

Characters are not necessarily 8 bits. So you need to do a bit more to have true portability.

wiredfool1y ago

Unless you’re writing code to decode image file formats.

ajross1y ago

No, same deal. The article argues that you should write portable code based on the ordered bytes in an external format, as that's guaranteed to be a machine-independent thing (i.e. it's stored on disk in exactly one way). Same is true for image files as 2-byte wchar file as zip files, yada yada.

It's true as far as it goes, but (1) it leans very heavily on the compiler understanding what you're doing and "un-portabilifying" your code when the native byte order matches the file format and (2) it presumes you're working with pickled "file" formats you "stream" in via bytes and not e.g. on memory mapped regions (e.g. network packets!) that want naturally to be inspected/modified in place.

It's fine advice though for the 90% of use cases. The author is correct that people tend to tie themselves into knots needlessly over this stuff.

j / k navigate · click thread line to collapse

58 comments

LegionMammal9781y ago

advisedwang1y ago

rocqua1y ago

Byte swapping is equivalent to needing to do encoding and decoding. Is it not?

LegionMammal9781y ago

But if you know you're not going to have endianness problems, you can just skip that step entirely.

GMoromisato1y ago

I think the article's author would say that loading data "without having to encode or decode each element" is premature optimization and more likely to have bugs. I tend to agree.

dwattttt1y ago

1 more reply

sobellian1y ago

sgarland1y ago

LegionMammal9781y ago

Not once you start getting into the range of hundreds of megabytes or more, which accounts for most situations where I'd use a binary format in the first place.

1 more reply

iscoelho1y ago

If you are using C/C++ for any new app, there is a possibility you are writing code that has a performance requirement.

- mmap/io_uring/drivers and additional "zero-copy" code implementations require consideration about byte order.

- filesystems, databases, network applications can be high throughput and will certainly benefit from being zero-copy (with benefits anywhere from +1% to +2000% in performance.)

hinkley1y ago

iscoelho1y ago

1 more reply

AlotOfReading1y ago

iscoelho1y ago

memcpy is extremely slow. On any high-load Linux webserver, you can type "perf top" and see 20%~ of the CPU usage consumed by memcpy/syscalls/virtual memory.

This article is a good demonstration of the performance improvements via mmap zero-copy: https://medium.com/@kaixin667689/zero-copy-principle-and-imp...

Netflix also relies on zero-copy via kTLS & zero-copy TLS to serve 400Gbps: https://papers.freebsd.org/2021/eurobsdcon/gallatin-netflix-...

3 more replies

nightowl_games1y ago

Yeah ive always been blown away by how fast memcpy is. I'm guessing the OP is from a different world of engineering than I am.

neonz801y ago

The compiler can optimize this. See https://gcc.godbolt.org/z/hxW7hhrd7

  #include <cstdint>
  uint32_t read_le_uint32(const uint8_t* p)
  {
      return p[0] | (p[1] << 8) | (p[2] << 16) | (p[3] << 24);
  }

ends up as

  read_le_uint32(unsigned char const*):
          mov     eax, dword ptr [rdi]
          ret

This works with Clang and gcc on x86_64 (but not with MSVC).

iscoelho1y ago

The purpose of zero-copy can be to avoid deserialization at all. All you do to deserialize is:

uint8_t *buf = ...; struct example_payload *payload = (struct example_payload *) buf;

plorkyeran1y ago

rocqua1y ago

My uses of mmap have only over been memoization. Where I didn't care about byte order, and instead just assumed the files wouldn't be portable between any two computers.

If you are going zero copy, you either need to give up on any kind of portability, or delve deep into compiler flags to standardize struct layout.

pmarreck1y ago

maybe i'm missing something because I don't code network drivers but wouldn't it be something like...

if it's little endian (on the wire), the process would be like:

    (value[0] | (value[1] << 8) | (value[2] << 16) | (value[3] << 24))

and in big endian (again, on the wire, architecture endianness irrelevant) it would be the same thing with the indices reversed, where "value" is the 4 bytes read in off the wire?

iscoelho1y ago

The performance would be absolutely horrendous if network drivers were programmed this way. DMA (Direct Memory Access) is all about avoiding deserialization and copies of the data.

paulddraper1y ago

> memcpy slow

Uh...

Compared to doing nothing, yes it's "slow."

chasil1y ago

TCP/IP is big-endian, which is likely the largest footprint for these concerns.

"htonl, htons, ntohl, ntohs - convert values between host and network byte order"

The cheapest big-endian modern device is a Raspberry Pi running a NetBSD "eb" release, for those who want to test their code.

https://wiki.netbsd.org/ports/evbarm/

Isamu1y ago

Yeah, you deal with order when marshaling stuff on the wire, I haven’t dealt with it much for years, but doing embedded software that used to be in my face a lot.

rwmj1y ago

Unless you're dealing with binary data in which case byte order matters very much and if you forget to convert it you're causing a world of pain for someone.

GMoromisato1y ago

Load binary data one byte at a time and use shifts and ORs to compose the larger unit based on the data's byte order. That's 100% portable without any #ifdefs for the machine's byte order.

genpfault1y ago

(2012)

Original thread w/104 comments:

https://news.ycombinator.com/item?id=3796378

AstralStorm1y ago

Really except for the networking (including say Bluetooth) nobody is big endian anymore. So how about just don't leak that thing from the network layer.

And do not define any data format to be big endian anymore. Deine it as little endian (do not leave it undefined) and everyone will be happy.

butterisgood1y ago

I think both SMB and 9p (Plan 9 resource sharing/file system protocol) are little endian.

So it's not even all networking... and "network byte order" will mess you up.

Laremere1y ago

This is a reasonable way to do things, and I've used it before. However I just used Zig's method here, and like it a lot: https://ziglang.org/documentation/master/std/#std.io.Reader....

edflsafoiewq1y ago

If we're talking about a single int, the way you do it doesn't matter, just wrap it up in a readInt function.

But if we're talking about a struct or an array, if you're byte-order aware you can do things like memcpy the whole thing around that you couldn't do by assembling it out of individual readInt calls.

wmf1y ago

It's probably faster to memcpy the thing then "swap" each element (the swaps may be no-ops under the hood). This should be portable and fast.

Laremere1y ago

Yeah it's not a hard thing to do, but I think Zig does it very cleanly.

As for reading structs, that's supported too: https://ziglang.org/documentation/master/std/#std.io.Reader....

1 more reply

ultrahax1y ago

As a games coder I was glad when the xbox 360 / ps3 era came to an end; getting big endian clients talking to little endian servers was an endless source of bugs.

benlivengood1y ago

_nalply1y ago

The first two methods work fine on all architectures, and that's what this article is about.

> As the target platform’s native endianness is used, portable code likely wants to use from_be_bytes or from_le_bytes, as appropriate instead.

fracus1y ago

ddingus1y ago

I came here to write the same. I learned a thing or two about how higher level languages work.

Two areas I find it does matter: Assembly language where bytes are parsed or sorted, or transformed in some way by code that writes words

, and

binary file representations when written on a little endian machine and read by a big endian machine.

nativeit1y ago

Edit: spelling error with “implicit”

e4m21y ago

Be aware that if you actually want to do as the article prescribes, don't just copy and paste -- you shan't take anything at face value in C: https://news.ycombinator.com/item?id=31718292.

wmf1y ago

He's right that you shouldn't use ifdefs, but I think a macro like le32toh() is far clearer and more concise than a bunch of shifts and ors.

Also, a lot of comments in this thread have nothing to do with the article and appear to be responses to some invisible strawman.

nuancebydefault1y ago

eternityforest1y ago

If Network Byte Order wasn't a thing, we could all just pretend big endian doesn't exist outside of mainframes.

wakawaka281y ago

Characters are not necessarily 8 bits. So you need to do a bit more to have true portability.

wiredfool1y ago

Unless you’re writing code to decode image file formats.

ajross1y ago

It's fine advice though for the 90% of use cases. The author is correct that people tend to tie themselves into knots needlessly over this stuff.

j / k navigate · click thread line to collapse