Reverse Engineer’s Perspective on the Boeing 787 ‘51 days’ Directive (opens in new tab)

(ioactive.com)

187 pointstrasz3y ago55 comments

55 comments

tjr3y ago

One thing to consider when looking at such things is that commercial avionics software systems are full of known limitations. I do not know if this particular 51-day limitation was intentional or not, but in general:

Avionics software starts with writing comprehensive requirements. When the software itself is developed based on those requirements, it is then tested against the requirements, always in a real functioning airplane, but also often in smaller airplane-cockpit-like rigs and in purely simulated environments.

Nobody is going to write a requirement that says "this avionics subsystem will function without error forever". Even if you thought you could make it happen, you can't test it. So there are going to be boundaries. You might say that the subsystem will function for X days. What happens after that? It may well run just fine for X+1 days, or 2X days, or 100X days. But it's only required to run for X days, and it's only tested and certified for running for X days.

I could easily imagine that this particular subsystem was required and certified for some value of X <=51 days, and it just so happened that if the subsystem ran for over 51 days then it started to fail. Or, it could have been a genuine mistake.

But even if the intended X wasn't 51 days, there almost certainly was some intended, finite value for X. We might say, "well, my laptop has run for three years without needing a reboot". Great! Is that a guaranteed, repeatable state of operation that the FAA would certify? Probably not. And besides that, do we really want to have to endure a three-year verification test?

In most software, we are happy to say, "it should run indefinitely". For avionics software, that's insufficient. We instead say "it will run at least for some specific predetermined finite amount of time" and then back up that statement with certifiable evidence.

Aloha3y ago

I work in a field that operates under similar development constraints. (Namely it's a mature product in a mature field with well defined requirements) Because if this I regularly get calls from my customers wondering why their system can't do X or Y in the B way instead of the A way, and I have a similar conversation. Wherein I have to explain "no, that wasn't part of your requirements 5 years ago, if you want to change it, you'll need to pay us for more development", that normally eliminates the requirement for whatever it was they wanted pretty quickly.

Also, uptime is a factor, I've seen what windows looks like when it runs out of GDI objects, it's strange. But once you see it, you can explain to the customer the importance of regular reboot/restarts.

hef198983y ago

I never understood why regular, and scheduled, reboots are concidered to be a problem to begin with.

ggm3y ago

It can come with exposure of hidden costs. So a pc which can only be assured to be correct by reboot cannot continuously monitor a flow process which cannot be interrupted for that reboot window. It has to be designed to work with two, or some kind of data buffering has to be designed in, or the specification changed to redefine to continuous(*)

Which btw is what should be done but.. it can cause rage

[*] may not be continuous or complete in all circumstances

1 more reply

xattt3y ago

I worked in healthcare where our EMR went into downtime for two hours on daylight transition days. It was extremely disruptive as we had to switch to a paper process for that time period that needed to get reconciled with the EMR at the end of the shift.

Unless you have a dedicated team doing that, preventative reboots and various “workarounds” sound great on paper for administrators but make for a shitty experience for people doing the actual work.

1 more reply

FPGAhacker3y ago

Sounds about right. But it’s still a critical failure for a fault of any kind to ever display incorrect information to the pilot.

phkahler3y ago

And in this case it seems one function of the software is interfering with another, which causes the incorrect display.

inferiorhuman3y ago

  I do not know if this particular 51-day limitation was intentional or no

I highly doubt it was intentional. Boeing's already had to issue an AD for similar behavior on the 787:

https://www.engadget.com/2015-05-01-boeing-787-dreamliner-so...

If they knew about it there'd be no need for an AD. Boeing tried to become the aviation equivalent of a fabless chip designer with the 787 and it didn't go well at all. Turns out they had little-to-no experience managing external development and manufacturing teams. I don't know anything about the 51-day bug, but the 248-day bug caused critical failures that you really wouldn't want happening in flight.

xattt3y ago

> Nobody is going to write a requirement that says "this avionics subsystem will function without error forever".

These time limits could at least be pegged to real-life intervals to when the system is going to be shut down anyway. If the system continues to be operated past that point, skipped maintenance intervals could be underlined as the cause.

trenchgun3y ago

It is on fact possible to write provably correct software for safety critical applications.

Not by testing, but by using formal methods.

dotancohen3y ago

That's nice for the software. Now how about the hardware? How about the electronic hardware's not-exposed firmware, does that count? Did the subcontractor test it for three years at 10,000 feet for radiation-induced bit-flips? With or without lightning strikes?

mschuster913y ago

> How about the electronic hardware's not-exposed firmware, does that count? Did the subcontractor test it for three years at 10,000 feet for radiation-induced bit-flips? With or without lightning strikes?

Blast the module in a radiation chamber. It can be done, it's only extremely expensive - the military has the budget (makes sense, given that a fighter jet or a bomber should be able to power through a nuclear bomb fallout), but civilian airliners are all about cost efficiency.

hef198983y ago

Including a system reboot, on the ground, as part of your on going maintenance activities is a fault, or incorrect software.

prennert3y ago

Is a roof you have to redo every 20 years, or a paint that only lasts 10 years faulty? Is a car that needs brakes replaced every X thousand kilometers faulty?

It is only faulty if it does not run according to spec, or if you it run outside the spec.

1 more reply

userbinator3y ago

For example, let’s imagine that the timestamp set by the transmitting ES is close to its wrap-around value. After performing the required calculation, the receiving ES obtains a timestamp that has already wrapped-around, so it would look like the message had been received before it was actually sent.

Isn't it surprising that modulo arithmetic, as already employed successfully in TCP sequence numbers and the like, still seems to be incorrectly implemented today? What's more disappointing is seeing all the other incredible systemic complexity they've added, and yet the plane appears to have no mechanical backup instruments?

steffan3y ago

To address the second part:

> and yet the plane appears to have no mechanical backup instruments[?]

This is unlikely in a modern aircraft because mechanical instruments to back up e.g., the artificial horizon / attitude indicator or directional gyro (DG) / heading indicator are:

1) Mechanically complex - the attitude indicator and DG make use of gyroscopes which rotate at up to 24,000 RPM along with other mechanisms. They are typically powered by vacuum or electric motors which consume relatively more power (or require vacuum lines and a vacuum pump)

2) Expensive to maintain - see (1) - they need to be serviced somewhat regularly

(3) Heavier than their solid-state counterparts

(4) Have [dramatically] different failure modes - instead of a display going dark, a DG will slowly drift as the gyroscope precesses, giving erroneous values. Same with the artificial horizon. This can lead to catastrophic results under instrument meteorologicalconditions (IMC) where the pilots rely solely on instruments to maintain essential things such as heading and level flight.

(5) Because of (4) they require additional redundancy to ensure instruments can be cross-checked with one another. This compounds (2) and (3)

caconym_3y ago

I think you are overstating the impracticality of mechanical standby instruments. Even glass cockpit GA aircraft typically came with fully mechanical backups until fairly recently---check out this SR22 cockpit as an example: https://commons.wikimedia.org/wiki/File:SR22TN_Perspective_C...

"Glass" standby instruments come with significant upside and not much downside, which is why they've been preferred in larger/more expensive aircraft for a while. There is nothing inherently more or less reliable about them, being fully isolated and redundant just as old-timey mechanical backups are, and they offer a much richer presentation (typically like a small PFD). However, new things are usually more expensive, which IIUC is why they were adopted first in larger, more expensive aircraft. They were considered a luxury in GA until fairly recently.

TillE3y ago

Plus the pilot stress of having to adjust to using dramatically different instruments when already in a difficult situation.

It's just not a workable idea in general. There are checklists for stuff like instrument failure which can probably recover from a software bug like this.

caconym_3y ago

It's absolutely a workable idea. Standby instruments are typically a requirement for glass cockpit aircraft, and before electronic standby instruments came onto the scene mechanical instruments were used in the standby role in (AFAIK) all sectors of aviation.

"Fly the airplane" is the highest priority in any aviation emergency, and in many emergencies you will need backup instruments to do so. I don't mean to be mean, but tbh it is a little absurd to suggest that a e.g. a pilot who loses her PFD in IMC is better off running checklists than using backup instruments to establish control of the aircraft and situational awareness, and bailing out asap. Sure, it's stressful, but it's also something pilots need to (and do) train for.

Once the aircraft is under control, you can run your checklists, or if you have a co-pilot you may be able to work in parallel. Maybe you will be able to fix the issue, and maybe you won't, but backups give you a shot at landing safely either way.

sneak3y ago

I think backups for the electronic systems would not need the same level of redundancy as the primary systems (which presumably already have backups).

It's sort of like how you don't need RAID for your offsite backup disks, just some parity for bit-rot.

The mechanical instruments would be the (additional) redundancy. The additional weight/lines/service is indeed burdensome even without redundant mechanical systems.

mlindner3y ago

> I think backups for the electronic systems would not need the same level of redundancy as the primary systems (which presumably already have backups).

If your backup is failing more often than your primary system then it's not much use as a backup.

Also, there ARE backups. There's fallback artificial horizon boxes that work independently of the rest of the system, for example.

Teongot3y ago

> Isn't it surprising that modulo arithmetic, as already employed successfully in TCP sequence numbers and the like, still seems to be incorrectly implemented today

Even in TCP sequence numbers, it can be implemented incorrectly.

https://engineering.skroutz.gr/blog/uncovering-a-24-year-old...

rootusrootus3y ago

Fascinating analysis. I know planes get used a lot, but I'm surprised that they go for such a long time without ever being powered down.

Aperocky3y ago

51 days seems to be approximately how often my mac dies in kernel panic or starting to be bugged by persistent software problems that go away with a restart.

dcormier3y ago

I’m at 356 days of uptime on my MacBook Pro. ¯\_(ツ)_/¯

assttoasstmgr3y ago

So you never install security updates? Because all Apple updates require a reboot due to their SIP "update the frozen image offline" nonsense.

1 more reply

Waterluvian3y ago

Awww ya jinxed it. And only nine days from retirement.

rootusrootus3y ago

Sadly I'm at 2 days, 12:41 myself. I don't get many kernel panics, but this most recent reboot was in fact a panic, coincidentally. Googled the error and it came up as something that happens with M1 Mac Minis while they're sleeping. But while my machine has a M1, it is an MBP and not a Mini. And it was not sleeping. Ah well.

mnw21cam3y ago

I just rebooted my EeePC. It had an uptime of 5.8 years. I only rebooted it to upgrade from Debian 9 to Debian 10, and I'll bump it up to Debian 11 later on. It has a broken screen, so it just sits on top of a cupboard with a couple of 4TB USB hard drives plugged into it, storing all my backups.

Moru3y ago

Last week I had to shut down my linux box for a move: up 3457 days. One day too late I guess :-)

JoshGlazebrook3y ago

But you typed this message 7 days ago.

pixelfarmer3y ago

I remember articles of the Airbus A350 requiring reboots every N days (150ish or so?). I remember the Patriot missile system required a reboot every 24 hours or so until they fixed the software defect which caused the time counting to drift. And I'm pretty sure there are many more such cases where devices fail if kept on for too long, even in spaces where you are supposed to fill out a lot of "paper"work + jump through a lot of defined processes like in avionics, medical, or automotive field, among a good few others (safety and all that).

Yizahi3y ago

We had a bug years ago that after 50 days of uptime all network sessions dropped on our devices. Apparently it was a session timer overflow in a variable. I think it was unsigned int and time was in milliseconds.

mormegil3y ago

149 hours, see https://en.wikipedia.org/wiki/List_of_software_bugs#Transpor...

rcyeh3y ago

tl;dr: 51 days is the wraparound point of a signed 6-byte counter running at 33 MHz, used to invalidate stale data from instruments.

saratogacx3y ago

When I saw 51 days my first thought is it had to be a time rollover. Mainly because of this bug from long ago and how close the time spans are.

https://www.cnet.com/culture/windows-may-crash-after-49-7-da...

drewrv3y ago

This assumes there is no margin of error baked into the 51 day rule, which surprises me.

lamontcg3y ago

This communication is not Boeing communicating a required maintenance interval, they're communicating a problem. It wouldn't seem natural for me for Boeing to add a random hidden margin in a problem description. When it comes to the maintenance remedy, I don't know if Boeing would do this or airlines or the FAA. Presumably the mandatory maintenance reboot interval will be much shorter than 51 days.

nimish3y ago

2^47/(32MHz) ~= 50.9 days

Not much of a margin there.

zokier3y ago

2^47/33e6 = 49.36 days. The value is so much off that makes me suspect that this is not the correct analysis, or at least that there are additional factors at play

1 more reply

taneq3y ago

I feel like even 51 minutes might be too long to wait before invalidating stale instrument data on an aeroplane...

Gibbon13y ago

All I have to say is if my firmware barf's after being up for 8.919 million years I won't care.

junar3y ago

Please add (2020) to the title.

kreelman3y ago

This is a really good analysis of the issue from the just the verbage from FAA. Well done.

acdanger3y ago

Reminds me of the LAX Air Traffic Control Shutdown of 2004: https://m.slashdot.org/story/49885

newsclues3y ago

Why?

Was it a cost issue?

Or was there an expectation that a regular maintenance check would occur within this time frame that involved a reboot as part of the maintenance check for diagnostics?

Taniwha3y ago

51 days is slightly more than 2^32 milliseconds?

kelnos3y ago

If that were the issue, then they'd have to reboot it every 49.7 days, no? Waiting 51 days would trigger the problem they're trying to avoid.

j / k navigate · click thread line to collapse

55 comments

tjr3y ago

Aloha3y ago

hef198983y ago

I never understood why regular, and scheduled, reboots are concidered to be a problem to begin with.

ggm3y ago

Which btw is what should be done but.. it can cause rage

[*] may not be continuous or complete in all circumstances

1 more reply

xattt3y ago

1 more reply

FPGAhacker3y ago

Sounds about right. But it’s still a critical failure for a fault of any kind to ever display incorrect information to the pilot.

phkahler3y ago

And in this case it seems one function of the software is interfering with another, which causes the incorrect display.

inferiorhuman3y ago

  I do not know if this particular 51-day limitation was intentional or no

I highly doubt it was intentional. Boeing's already had to issue an AD for similar behavior on the 787:

https://www.engadget.com/2015-05-01-boeing-787-dreamliner-so...

xattt3y ago

> Nobody is going to write a requirement that says "this avionics subsystem will function without error forever".

trenchgun3y ago

It is on fact possible to write provably correct software for safety critical applications.

Not by testing, but by using formal methods.

dotancohen3y ago

mschuster913y ago

hef198983y ago

Including a system reboot, on the ground, as part of your on going maintenance activities is a fault, or incorrect software.

prennert3y ago

Is a roof you have to redo every 20 years, or a paint that only lasts 10 years faulty? Is a car that needs brakes replaced every X thousand kilometers faulty?

It is only faulty if it does not run according to spec, or if you it run outside the spec.

1 more reply

userbinator3y ago

steffan3y ago

To address the second part:

> and yet the plane appears to have no mechanical backup instruments[?]

This is unlikely in a modern aircraft because mechanical instruments to back up e.g., the artificial horizon / attitude indicator or directional gyro (DG) / heading indicator are:

2) Expensive to maintain - see (1) - they need to be serviced somewhat regularly

(3) Heavier than their solid-state counterparts

(5) Because of (4) they require additional redundancy to ensure instruments can be cross-checked with one another. This compounds (2) and (3)

caconym_3y ago

TillE3y ago

Plus the pilot stress of having to adjust to using dramatically different instruments when already in a difficult situation.

It's just not a workable idea in general. There are checklists for stuff like instrument failure which can probably recover from a software bug like this.

caconym_3y ago

sneak3y ago

I think backups for the electronic systems would not need the same level of redundancy as the primary systems (which presumably already have backups).

It's sort of like how you don't need RAID for your offsite backup disks, just some parity for bit-rot.

The mechanical instruments would be the (additional) redundancy. The additional weight/lines/service is indeed burdensome even without redundant mechanical systems.

mlindner3y ago

> I think backups for the electronic systems would not need the same level of redundancy as the primary systems (which presumably already have backups).

If your backup is failing more often than your primary system then it's not much use as a backup.

Also, there ARE backups. There's fallback artificial horizon boxes that work independently of the rest of the system, for example.

Teongot3y ago

> Isn't it surprising that modulo arithmetic, as already employed successfully in TCP sequence numbers and the like, still seems to be incorrectly implemented today

Even in TCP sequence numbers, it can be implemented incorrectly.

https://engineering.skroutz.gr/blog/uncovering-a-24-year-old...

rootusrootus3y ago

Fascinating analysis. I know planes get used a lot, but I'm surprised that they go for such a long time without ever being powered down.

Aperocky3y ago

51 days seems to be approximately how often my mac dies in kernel panic or starting to be bugged by persistent software problems that go away with a restart.

dcormier3y ago

I’m at 356 days of uptime on my MacBook Pro. ¯\_(ツ)_/¯

assttoasstmgr3y ago

So you never install security updates? Because all Apple updates require a reboot due to their SIP "update the frozen image offline" nonsense.

1 more reply

Waterluvian3y ago

Awww ya jinxed it. And only nine days from retirement.

rootusrootus3y ago

mnw21cam3y ago

Moru3y ago

Last week I had to shut down my linux box for a move: up 3457 days. One day too late I guess :-)

JoshGlazebrook3y ago

But you typed this message 7 days ago.

pixelfarmer3y ago

Yizahi3y ago

mormegil3y ago

149 hours, see https://en.wikipedia.org/wiki/List_of_software_bugs#Transpor...

rcyeh3y ago

tl;dr: 51 days is the wraparound point of a signed 6-byte counter running at 33 MHz, used to invalidate stale data from instruments.

saratogacx3y ago

When I saw 51 days my first thought is it had to be a time rollover. Mainly because of this bug from long ago and how close the time spans are.

https://www.cnet.com/culture/windows-may-crash-after-49-7-da...

drewrv3y ago

This assumes there is no margin of error baked into the 51 day rule, which surprises me.

lamontcg3y ago

nimish3y ago

2^47/(32MHz) ~= 50.9 days

Not much of a margin there.

zokier3y ago

2^47/33e6 = 49.36 days. The value is so much off that makes me suspect that this is not the correct analysis, or at least that there are additional factors at play

1 more reply

taneq3y ago

I feel like even 51 minutes might be too long to wait before invalidating stale instrument data on an aeroplane...

Gibbon13y ago

All I have to say is if my firmware barf's after being up for 8.919 million years I won't care.

junar3y ago

Please add (2020) to the title.

kreelman3y ago

This is a really good analysis of the issue from the just the verbage from FAA. Well done.

acdanger3y ago

Reminds me of the LAX Air Traffic Control Shutdown of 2004: https://m.slashdot.org/story/49885

newsclues3y ago

Why?

Was it a cost issue?

Or was there an expectation that a regular maintenance check would occur within this time frame that involved a reboot as part of the maintenance check for diagnostics?

Taniwha3y ago

51 days is slightly more than 2^32 milliseconds?

kelnos3y ago

If that were the issue, then they'd have to reboot it every 49.7 days, no? Waiting 51 days would trigger the problem they're trying to avoid.

j / k navigate · click thread line to collapse