Uncovering a 24-year-old bug in the Linux Kernel (opens in new tab)

(engineering.skroutz.gr)

497 pointsgreenonion5y ago41 comments

41 comments

Besides being a great technical write-up, this does an absolutely fantastic job of doing low-key recruitment for Skroutz. It shows some of the main tools and technologies that the company uses on a day-to-day basis, provides a window into the way that they approach problems, makes a compelling case that you'd be working with some talented engineers, and showcases a culture willing to engage with the open source community.

The hiring pitch isn't in your face, but there's a "We're hiring!" button in the banner, which fairly unobtrusively follows you down the page, and then ends with a "hey, if you're interested in working with us, reach out." Overall, it just feels really well done.

gerdesj5y ago

Great write up. Think I'll get the kids, sorry technicians to walk through this. Actually, I think I'll learn just as much but I have to keep a little bit aloof as MD!

Networks are tricky to run and networking is proper hard to do. TCP/UDP int al are pretty bloody good at shuffling data from A to B. I find it quite amusing when 20 years is considered old for a bug.

The Millenium bridge in London is a classic example of forgetting the basics - in this case resonance and being too clever for your own good. It's a rather cool design for a bridge - a sort of a suspension bridge but flatter and some funky longitudinal stuff. I'm a Civ Eng grad. It looked too flat to me from day one.

When people walk across a bridge and it starts to sway, they start to lock step and then resonance, where each step reinforces the last kicks in and more and more energy causes sway, shear and what have you forces. It gets worse and worse and then failure. Tacoma Narrows is another classic example of resonance but due to wind - that informed designs that don't fly!

Civ Eng is way, way older than IT and we are still learning. 24 years is nothing for a bug. However, IT is capable of looking inward and monitoring itself (unit tests, ping etc) in a way that Civ Eng can't (OK we have strain gauges and a few other tools).

The real difference between physical stuff and IT is that the Milli bridge rather obviously came close to failure visually and in a way that our other senses can perceive - it shook. The fix was to put hydraulic dampers along its length.

In IT, we often try to fix things by using magic or papering over flaws with "just so" stories. Sometimes we get the tools out and do the job properly and these boys and girls did just that: the job properly.

kazen445y ago

> When people walk across a bridge and it starts to sway, they start to lock step and then resonance, where each step reinforces the last kicks in and more and more energy causes sway, shear and what have you forces. It gets worse and worse and then failure. Tacoma Narrows is another classic example of resonance but due to wind - that informed designs that don't fly!

this anecdote reminds me of the story of ancient rome. (I don't know if this is actual history or a myth).

Apparently, when roman military engineers build a bridge, they where forced to stand beneath it while the rest of the cohort marched across the bridge to test it's strength.

Marching gives exactly this same resonance effect.

tjalfi5y ago

Your anecdote reminds me of this quote about Dupont's safety program.

"My company has had a safety program for 150 years. The program was instituted as a result of a French law requiring an explosives manufacturer to live on the premises with his family." - Crawford Greenewalt

nuclearnice15y ago

Apparently the British army were told to break step when crossing bridges to avoid that potential disaster.

https://en.m.wikipedia.org/wiki/Broughton_Suspension_Bridge

3 more replies

rancor5y ago

I have, admittedly old and very vague, memories of people talking about rsync being "hard on networks" or "dealing poorly with congestion." I'd put good odds that this bug is why those statements existed.

jandrese5y ago

This seems to be the opposite. You only see it when transferring titanic amounts of data over a pristine connection. If your network had congestion you wouldn't trigger this bug.

But this also explains a bit why rsync is "hard on networks". Most bulk data transfers end up with breaks in the data that give more breathing room to other protocols. Not rsync, it tries as hard as it can to keep the pipe full 100% of the time, making it hard for other TCP slow starters to get a foothold.

tinus_hn5y ago

BitTorrent does the same thing and used to be a lot more common, just typically not between hosts close to each other.

1 more reply

Dylan168075y ago

One big fat TCP connection isn't so bad on networks. Especially the default behavior of slowly creeping up in speed until it loses a packet, then dropping back down.

As I understand it significant factor in getting this bug to happen is that you're sending tons of data but in a way that's limited by the source.

breatheoften5y ago

rsync is hard on networks???????????

Is there any truth to this? I find it hard to believe -- most of the time rsync is tunneled over ssh which seems well enough abstracted from an optimal traffic generation mechanism that i would seriously doubt it's able to outcompete other programs for network resources in a meaningful way ... perhaps this observation evolved because there are a lot of networks that have traffic shaping rules for ssh? unfortunate effects of traffic shaping rules for ssh + low bandwidth connection + rsyncs happening over ssh + an administrator logged into an ssh port via the low bandwidth link

could maybe produce this observed (but non-sensical?) correlation?

bonzini5y ago

The bug requires transferring over 2GB of data without reading from the socket, so it's unlikely; also, a hang is the opposite of being hard on the network. ;) However the uncommon characteristics of rsync traffic is probably why some congestion control algorithms may not deal well with rsync.

zackmorris5y ago

I just want to send kudos to them. I lost 2 years trying to write a reliable stream over UDP back in the days of Zoidcom and similar, maybe 2005. I don't know how to stress this enough but...it's basically an impossible challenge.

This writeup represents the depths that an engineer has to go to get real work done. I'm familiar with the integer wraparound comparison issue, and all of the other errata around TCP windowing. Thankfully countless people have done this work and we're able to enjoy the fruits of their labor today.

Not sure where I'm going with this, but I've been programming for 30 years, and to this day, I view kernel developers and the people who isolate these bugs as the very best among us.

rob745y ago

Great writeup, and also thoroughly answers the first question that popped into my mind: "how on earth could a bug in the Linux network stack that causes the whole data transfer to get stuck stay undiscovered for so long?"

gerdesj5y ago

"Most applications will care about network timeouts and will either fail or reconnect, making it appear as a “random network glitch” and leaving no trace to debug behind."

tryauuum5y ago

I have seen an ancient "drop packets with zero-length tcp window" rule in iptables in my company. Funny enough to learn that zero-length tcp window can be found in normal, non-malicious packets!

kazen445y ago

the amount of firewall vendor's who drop this kind of PDU by default is astounding.

I once spend a week troubleshooting a firewall at a customer's side who had a similair issue with zero-length tcp window PDU's.

The firewalls the customers used also didn't allow a change in this behaviour. Luckely they where able to solve this in their software, but still, these kind of things should be configurable in a networking product.

guenthert5y ago

Impressive detective work and documentation.

mooman2195y ago

It's like watching a murder mystery unfold. It feels really daunting to dive this deep into a bug on its vague symptoms. It's probably the selection bias for what gets on the HN front page, but it feels like a large minority here can tackle something like this. I have trouble imaging having that much of a handle on Linux to feel comfortable hot patching the kernel because I suspect something is wrong in the networking stack.

kazen445y ago

Also, their networking troubleshooting inside linux is solid aswell.

There are very few engineers who seem to understand the details of TCP, especially it's more obscure aspects.

1 more reply

leesalminen5y ago

I wonder is this is the cause of a nasty NFSv3 issue I was having years ago where clients would periodically hang indefinitely, necessitating a system reboot. At the time, we were ingesting large videos on the client and transferring to a shared volume via nfs.

jandrese5y ago

I'd suspect a bug in the NFS implementation. That would hardly be unheard of.

NFS's failure mode of freezing up your system and requiring a full reboot to clear is purestrain NFS though. I never understood why the idea of an eventual soft failure (returning a socket error) was considered unacceptable in NFS land.

toast05y ago

> I never understood why the idea of an eventual soft failure (returning a socket error) was considered unacceptable in NFS land.

Problems like this are usually the result of being unable to decide on an appropriate timeout; so no timeout is chosen. I like to suggest rather long timeouts, like one day or one week, rather than forever to get beyond that. Very few people are going to say, after a read tried for a whole day that it should have tried longer.

Another issue is that POSIX file i/o doesn't have great error indicators; so it can be tricky to plumb things through in clearly correct ways.

1 more reply

janselman5y ago

Great write up. It's common to add retry or reconnect mechanism to connections even there is no requirement. It's basically "restart your computer" to see if the issue disappears. So, it actually hides bugs for decades :)

bjeds5y ago

Wow, superb writeup, thank you author for writing it and the submitter for posting it!

lykr0n5y ago

wow. We've run into rsync bugs like this and just chalked it up to "things happen"

SeriousM5y ago

This was a great article! I wonder why such problems don't show up on windows.. is it because they have so many developers or that windows has to reboot at least every two weeks?

jorams5y ago

They do, you just don't hear about it.

db48x5y ago

Great find, and it sounds like a great place to work.

falseprofit5y ago

Just learned that Skroutz is pronounced 'Scrooge' in Greek, and this isn't a coincidence!

unwind5y ago

Since they never described the context, Skroutz seems to be the dominant online price-comparison site for Greece. Which, I agree, would make the name make sense.

I had never heard the name before, and I felt the article lacked some context. Googling it, there seems to be very little content about them in English, which makes the nice blog post almost surprising. :)

1 more reply

__bjoernd5y ago

Great writeup!

j / k navigate · click thread line to collapse

41 comments

qchris5y ago

gerdesj5y ago

Great write up. Think I'll get the kids, sorry technicians to walk through this. Actually, I think I'll learn just as much but I have to keep a little bit aloof as MD!

Networks are tricky to run and networking is proper hard to do. TCP/UDP int al are pretty bloody good at shuffling data from A to B. I find it quite amusing when 20 years is considered old for a bug.

kazen445y ago

this anecdote reminds me of the story of ancient rome. (I don't know if this is actual history or a myth).

Apparently, when roman military engineers build a bridge, they where forced to stand beneath it while the rest of the cohort marched across the bridge to test it's strength.

Marching gives exactly this same resonance effect.

tjalfi5y ago

Your anecdote reminds me of this quote about Dupont's safety program.

nuclearnice15y ago

Apparently the British army were told to break step when crossing bridges to avoid that potential disaster.

https://en.m.wikipedia.org/wiki/Broughton_Suspension_Bridge

3 more replies

rancor5y ago

jandrese5y ago

This seems to be the opposite. You only see it when transferring titanic amounts of data over a pristine connection. If your network had congestion you wouldn't trigger this bug.

tinus_hn5y ago

BitTorrent does the same thing and used to be a lot more common, just typically not between hosts close to each other.

1 more reply

Dylan168075y ago

One big fat TCP connection isn't so bad on networks. Especially the default behavior of slowly creeping up in speed until it loses a packet, then dropping back down.

As I understand it significant factor in getting this bug to happen is that you're sending tons of data but in a way that's limited by the source.

breatheoften5y ago

rsync is hard on networks???????????

could maybe produce this observed (but non-sensical?) correlation?

bonzini5y ago

zackmorris5y ago

Not sure where I'm going with this, but I've been programming for 30 years, and to this day, I view kernel developers and the people who isolate these bugs as the very best among us.

rob745y ago

gerdesj5y ago

"Most applications will care about network timeouts and will either fail or reconnect, making it appear as a “random network glitch” and leaving no trace to debug behind."

tryauuum5y ago

I have seen an ancient "drop packets with zero-length tcp window" rule in iptables in my company. Funny enough to learn that zero-length tcp window can be found in normal, non-malicious packets!

kazen445y ago

the amount of firewall vendor's who drop this kind of PDU by default is astounding.

I once spend a week troubleshooting a firewall at a customer's side who had a similair issue with zero-length tcp window PDU's.

guenthert5y ago

Impressive detective work and documentation.

mooman2195y ago

kazen445y ago

Also, their networking troubleshooting inside linux is solid aswell.

There are very few engineers who seem to understand the details of TCP, especially it's more obscure aspects.

1 more reply

leesalminen5y ago

jandrese5y ago

I'd suspect a bug in the NFS implementation. That would hardly be unheard of.

toast05y ago

> I never understood why the idea of an eventual soft failure (returning a socket error) was considered unacceptable in NFS land.

Another issue is that POSIX file i/o doesn't have great error indicators; so it can be tricky to plumb things through in clearly correct ways.

1 more reply

janselman5y ago

bjeds5y ago

Wow, superb writeup, thank you author for writing it and the submitter for posting it!

lykr0n5y ago

wow. We've run into rsync bugs like this and just chalked it up to "things happen"

SeriousM5y ago

This was a great article! I wonder why such problems don't show up on windows.. is it because they have so many developers or that windows has to reboot at least every two weeks?

jorams5y ago

They do, you just don't hear about it.

db48x5y ago

Great find, and it sounds like a great place to work.

falseprofit5y ago

Just learned that Skroutz is pronounced 'Scrooge' in Greek, and this isn't a coincidence!

unwind5y ago

Since they never described the context, Skroutz seems to be the dominant online price-comparison site for Greece. Which, I agree, would make the name make sense.

1 more reply

__bjoernd5y ago

Great writeup!

j / k navigate · click thread line to collapse