The hiring pitch isn't in your face, but there's a "We're hiring!" button in the banner, which fairly unobtrusively follows you down the page, and then ends with a "hey, if you're interested in working with us, reach out." Overall, it just feels really well done.
Networks are tricky to run and networking is proper hard to do. TCP/UDP int al are pretty bloody good at shuffling data from A to B. I find it quite amusing when 20 years is considered old for a bug.
The Millenium bridge in London is a classic example of forgetting the basics - in this case resonance and being too clever for your own good. It's a rather cool design for a bridge - a sort of a suspension bridge but flatter and some funky longitudinal stuff. I'm a Civ Eng grad. It looked too flat to me from day one.
When people walk across a bridge and it starts to sway, they start to lock step and then resonance, where each step reinforces the last kicks in and more and more energy causes sway, shear and what have you forces. It gets worse and worse and then failure. Tacoma Narrows is another classic example of resonance but due to wind - that informed designs that don't fly!
Civ Eng is way, way older than IT and we are still learning. 24 years is nothing for a bug. However, IT is capable of looking inward and monitoring itself (unit tests, ping etc) in a way that Civ Eng can't (OK we have strain gauges and a few other tools).
The real difference between physical stuff and IT is that the Milli bridge rather obviously came close to failure visually and in a way that our other senses can perceive - it shook. The fix was to put hydraulic dampers along its length.
In IT, we often try to fix things by using magic or papering over flaws with "just so" stories. Sometimes we get the tools out and do the job properly and these boys and girls did just that: the job properly.
this anecdote reminds me of the story of ancient rome. (I don't know if this is actual history or a myth).
Apparently, when roman military engineers build a bridge, they where forced to stand beneath it while the rest of the cohort marched across the bridge to test it's strength.
Marching gives exactly this same resonance effect.
"My company has had a safety program for 150 years. The program was instituted as a result of a French law requiring an explosives manufacturer to live on the premises with his family." - Crawford Greenewalt
But this also explains a bit why rsync is "hard on networks". Most bulk data transfers end up with breaks in the data that give more breathing room to other protocols. Not rsync, it tries as hard as it can to keep the pipe full 100% of the time, making it hard for other TCP slow starters to get a foothold.
As I understand it significant factor in getting this bug to happen is that you're sending tons of data but in a way that's limited by the source.
Is there any truth to this? I find it hard to believe -- most of the time rsync is tunneled over ssh which seems well enough abstracted from an optimal traffic generation mechanism that i would seriously doubt it's able to outcompete other programs for network resources in a meaningful way ... perhaps this observation evolved because there are a lot of networks that have traffic shaping rules for ssh? unfortunate effects of traffic shaping rules for ssh + low bandwidth connection + rsyncs happening over ssh + an administrator logged into an ssh port via the low bandwidth link
could maybe produce this observed (but non-sensical?) correlation?
This writeup represents the depths that an engineer has to go to get real work done. I'm familiar with the integer wraparound comparison issue, and all of the other errata around TCP windowing. Thankfully countless people have done this work and we're able to enjoy the fruits of their labor today.
Not sure where I'm going with this, but I've been programming for 30 years, and to this day, I view kernel developers and the people who isolate these bugs as the very best among us.
I once spend a week troubleshooting a firewall at a customer's side who had a similair issue with zero-length tcp window PDU's.
The firewalls the customers used also didn't allow a change in this behaviour. Luckely they where able to solve this in their software, but still, these kind of things should be configurable in a networking product.
There are very few engineers who seem to understand the details of TCP, especially it's more obscure aspects.
NFS's failure mode of freezing up your system and requiring a full reboot to clear is purestrain NFS though. I never understood why the idea of an eventual soft failure (returning a socket error) was considered unacceptable in NFS land.
Problems like this are usually the result of being unable to decide on an appropriate timeout; so no timeout is chosen. I like to suggest rather long timeouts, like one day or one week, rather than forever to get beyond that. Very few people are going to say, after a read tried for a whole day that it should have tried longer.
Another issue is that POSIX file i/o doesn't have great error indicators; so it can be tricky to plumb things through in clearly correct ways.
I had never heard the name before, and I felt the article lacked some context. Googling it, there seems to be very little content about them in English, which makes the nice blog post almost surprising. :)