- When your device is in use in the field, the user will be too hot, too cold, too windy, too dark, too tired, too wet, too rushed, or under fire. Mistakes will be made. Design for that environment. Simplify controls. Make layouts very clear. Military equipment uses connectors which cannot be plugged in wrong, even if you try to force them. That's why. (Former USMC officer.)
- Make it easy to determine what's broken. Self-test features are essential. (USAF officer.)
- If A and B won't interoperate, check the interface specification. Whoever isn't compliant with the spec is wrong. They have to fix their side. If you can't decide who's wrong, the spec is wrong. This reduces interoperability from an O(N^2) problem to an O(N) problem. (DARPA program manager.)
- If the thing doesn't meet spec, have Q/A put a red REJECTED tag on it. The thing goes back, it doesn't get paid for, the supplier gets pounded on by Purchasing and Quality Control, and they get less future business. It's not your job to fix their problem. (This was from an era when DoD customers had more clout with suppliers.)
- There are not "bugs". There are "defects". (HP exec.)
- Let the fighter pilot drive. Just sit back and enjoy the world zooming by. (Navy aviator.)
Aerospace is a world with many hard-ass types, many of whom have been shot at and shot back, have landed a plane in bad weather, or both.
Priceless, I've been trying make that point for years but nobody seems to want to listen.
A buggy device is much worse for any critical application - it appears to work under inspection, and even limited testing, but 1% of the time it develops some fatal data race condition that causes it to fail erratically and cause havoc, e.g. the Therac-25.
So buggy is a subset of defective but it's even worse?
Instrument training in FAA-land requires learners to understand the five hazardous attitudes: anti-authority ("the rules don't apply to me"), impulsivity ("gotta do something now!), invulnerability ("I can get away with it"), macho ("watch this!"), and resignation ("I can't do anything to stop the inevitable"). Although the stakes are different, they have applicability to software development. Before a situation gets out of hand, the pilot has to recognize and label a particular thought and then think of the antidote, e.g., "the rules are there to keep me safe" for anti-authority.
Part 121 or scheduled airline travel owes its safety record to many layers of redundancy. Two highly trained and experienced pilots are in the cockpit talking to a dispatcher on the ground, for example. They're looking outside and also have Air Traffic Control watching out for them. The author mentioned automation. This is an area where DevSecOps pipelines can add lots of redundancy in a way that leaves machines doing tedious tasks that machines are good at. As in the cockpit, it's important to understand and manage the automation rather than following the magenta line right into cumulogranite.
Remember the importance of checklists in the "grand scheme of things". It helps maintain proper "authority" during operation and makes sure you don't forget things. If you don't write it down and check it, someone, at a certain moment will forget something.
Also, the "Aviate, navigate, communicate" axiom (as mentioned by author) is really helpful if you're trying to setup incident/crisis response structures. You basically get your guiding principles for free from an industry that has 100+ years of experience in dealing with crisises. It's something I teach during every incident/crisis response workshop.
edit: Although it's not aviation specific, and a little light on the science, "The Checklist Manifesto" by A. Gawande is a nice introduction into using (and making) checklists.
Of course in Aviation the 'authorities' are usually rational and fair. In many other areas of life they are neither, and are incompetent to boot. Being anti-authority is justified in such cases. i.e. there is a moral responsibility to disobey unjust laws.
As a new pilot myself, I can say with confidence that the FAA has some major flaws and the US congress has been able to get their dirty hands in aviation policy to enact separate rules that did not originate from the recommendations by the NTSB.
WRT development, I wonder if there are attitudes that can be applied to software and hardware design that combat bad systems.
For example, cars with touchscreens instead of individual controls.
Thank you for recommending the NTSB reports :-)
a major focus has been on encouraging less-experienced members of the flight crew to speak up if they notice something wrong, and ensuring that the more-experienced pilots are open to receiving that feedback instead of adopting an "I'm more senior, I know what I'm doing, don't question it" attitude.
the latter's videos tend to have clickbait-y titles to make the YouTube algorithm happy, but the content is excellent.
0: https://admiralcloudberg.medium.com/
1: https://www.reddit.com/r/AdmiralCloudberg/comments/e6n80m/pl...
[1] https://publications.americanalpineclub.org/about_the_accide...
A few pro-tips:
- Slow is smooth. Smooth is fast.
- Simplicity/speed/efficiency is good.
- Strong is good. Redundancy is good/better.
- Equalization and Extension are a trade off.
- Not everything is a nail so NOT every tool is a hammer.
- Tools that have multiple uses are good.
Now you've jinxed it!
I also recommend Admiral Cloudberg. https://admiralcloudberg.medium.com/drama-in-the-snow-the-cr...
NTSB reports for general aviation tend to focus on individual mistakes since that's most often solo pilots with no ground crew, but for commercial flights it's generally a more complex series of mistakes made in a team.
The hydraulic actuators (rams) have an input and an output port. Connecting the hydraulic lines to the wrong port results in control reversal. To defend against that:
1. One port has left handed threads, the other right handed threads
2. The ports are different sizes
3. The ports are color coded
4. The lines cannot be bent to reach the wrong port
5. Any work on it has to be checked, tested, and signed off by another mechanic
And finally:
5. Part of the preflight checklist is to verify that the control surfaces move the right way
I haven't heard of a control reversal on airliners built this way, but I have heard of it happening in older aircraft after an overhaul.
Safe Systems from Unreliable Parts https://www.digitalmars.com/articles/b39.html
Designing Safe Software Systems Part 2 https://www.digitalmars.com/articles/b40.html
Some time ago I shipped a product running an RTOS which unfortunately had a subtle scheduler bug where it would randomly crash periodically. The bug was pretty rare (I thought), only affecting part of the system, and reproducing the bug took several days each time.
In my infinite genius, rather than waste weeks of valuable time up to release, I set up the watchdog timer on the processor to write a crash dump and silently reboot. A user would maybe see a few seconds of delayed input and everything would come back up shortly.
Unfortunately, I had accidentally set the watchdog clock divider the wrong way, resulting in the watchdog not activating for over 17 hours after a hang!
The bug became much more widely noticeable after the product was released, and only by sheer luck, many people never noticed it.
I eventually fixed the scheduler bug in an update, but the useless watchdog configuration was set in stone and not fixable. Taught me to never assume a rare bug would stay rare when many tens of thousands of people use something in the field.
NIT, A is written as Alfa in the NATO alphabet [0] as it is easier to understand its pronunciation. For the same reason J is written as Juliett (two t), because in some languages t can be silent.
And I hate systems that don't let you say "ignore *this* warning" without turning off all warnings. I have some Tile trackers with dead batteries--but there's no way I can tell the app to ignore *that* dead battery yet tell me about any new ones that are growing weak. (We haven't been using our luggage, why should I replace the batteries until such day as the bags are going to leave the house again?)
I fit this myself: I grew up playing flight simulators, studied computer science as an undergrad, was a military helicopter pilot for a while, and then went to grad school for computer science. Along the way, I've personally met at least half a dozen other academic computer scientists with a pilot's license or military aviation background. Is it just selective attention / frequency illusion for me, or is there more to this?
I bet that a large part of why is that people here tend to have reasonably high incomes, and flying is an expensive hobby. I'm sure that flying would be an incredibly popular hobby across all demographics if it were affordable.
The thing you need to be rich in is free time. You can have all the money in the world but if you don’t have the time to put in, you’re not getting into this hobby.
This is important, but I'm not sure everybody necessarily agrees on what "fail safely" means.
Fail safely can mean one of:
- It doesn't fail silently
- It doesn't cause cascading failures
- It doesn't cause infinite failure loops
- It doesn't fail in ways that corrupt data
- It doesn't fail in ways you lose money
- You can safely retry
- You can safely retry anytime (not just today, or just this month)
A large passenger aircraft does not solely consist of Level A software. There’s plenty of not-flight-safety-critical software on any airplane you ride as a civilian passenger, but there is some Level A software that could cause the worst consequences if it fails.
Think about what pieces of your software are critical to your company/team’s mission, and which aren’t so bad if they fail. Not every line of code you write, or system you build, will wreak havoc on your company’s primary mission.
Let me give you a simple and easy to understand example: an MP3 decoder performs the boring task of transforming one bunch of numbers into another bunch of numbers. This second bunch of numbers is then fed into a DAC which in turn feeds into an amplifier. If your software malfunctions it could cause an ear splitting sound to appear with zero warning while the vehicle that your MP3 decoder has been integrated into is navigating a complex situation. The reaction of the driver can range from complete calm all that way to a panic including involuntary movements. This in turn can cause loss or damage of property, injury and ultimately death.
Farfetched? Maybe. But it almost happened to me, all on account of a stupid bug in an MP3 player. Fortunately nothing serious happened but it easily could have.
So most of us should try harder to make good software, because (1) there should be some pride in creating good stuff and (2) you never really know how your software will be used once it leaves your hands so better safe than sorry.
I make video games. _Everything_ in games is a trade-off. There are areas of my code that are bulletproof, well tested, fuzzex and rock solid. There are parts of it (running in games people play, a lot) that will whiff if you squint too hard at it. Deciding when to employ the second technique is a very powerful skill, and knowing what corners to cut can result in software or experiences that handle the golden path case so much better, you decide it's worth the trade off of cutting said corner.
I'll let you know when I find the right balance.
QA: ”Look, if that integer overflows here, your software is going to fail.” Dev: “Well, it’s a cooking recipe app. Nobody’s gonna die!” How low of an opinion you must have of your own profession if you’re going to excuse yourself this way!
My own contribution is to recommend reading risks digest:
A lot of the difficultly boils down to an inverse NIH syndrome: we outsource monitoring and alerting … and the systems out there are quite frankly pretty terrible. We struggle with alert routing, because alert routing should really take a function that takes alert data in and figures out what to do with it … but Pagerduty doesn't support that. Datadog (monitoring) struggles (struggles) with sane units, and IME with aliasing. DD will also alert on things that … don't match the alert criteria? (We've still not figured that one out.)
“Aviate, Navigate, Communicate” definitely is a good idea, but let me know if you figure out how to teach people to communicate. Many of my coworkers lack basic Internet etiquette. (And I'm pretty sure "netiquette" died a long time ago.)
The Swiss Cheese model isn't just about having layers to prevent failures. The inverse axiom is where the fun starts: the only failures you see, by definition, are the ones that go through all the holes in the cheese simultaneously. If they didn't, then by definition, a layer of swiss has stopped the outage. That means "how can this be? like n different things would have to be going wrong, all at the same time" isn't really an out in an outage: yes, by definition! This is too, of course, assuming you know what holes are in your cheese, and often, the cheese is much holier than people seem to think it is.
I'm always going to hard disagree with runbooks, though. Most failures are of the "it's a bug" variety: there is no possible way to write the runbook for them. If you can write a runbook, that means you're aware of the bug: fix the bug, instead. The rest is bugs you're unaware of, and to write a runbook would thus require clairvoyance. (There are limited exceptions to this: sometimes you cannot fix the bug: e.g., if the bug lies in a vendor's software and the vendor refuses to do anything about it¹, then you're just screwed, and have to write down the next best work around, particularly if any workaround is hard to automate. There are other pressures, like PMs who don't give devs the time to fix bugs, but in general runbooks are a drag on productivity, as they're manual processes you're following in lieu of a working system. Be pragmatic about when you take them on (if you can).
> Have a “Ubiquitous language”
This one, this one is the real gem. I beg of you, please, do this. A solid ontology prevents bugs.
This gets back to the "teach communication" problem, though. I work with devs who seem to derive pleasure from inventing new terms to describe things that already have terms. Communicating with them is a never ending game of grabbing my crystal ball and decoding WTF it is they're talking about.
Also, I know the NATO alphabet (I'm not military/aviation). It is incredibly useful, and takes like 20-40 minutes of attempting to memorize it to get it. It is mind boggling that customer support reps do not learn this, given how shallow the barrier to entry is. (They could probably get away with like, 20 minutes of memorization & then learn the rest just via sink-or-swim.)
(I also have what I call malicious-NATO: "C, as in sea", "Q, as in cue", "I, as in eye", "R, as in are", U, as in "you", "Y, as in why")
> Don’t write code when you are tired.
Yeah, don't: https://www.cdc.gov/niosh/emres/longhourstraining/impaired.h...
And yet I regularly encounter orgs or people suggesting that deployments should occur well past the 0.05% BAC equivalent mark. "Unlimited PTO" … until everyone inevitably desires Christmas off and then push comes to shove.
Some of this intertwines with common PM failure modes, too: I have, any number of times, been pressed for time estimates on projects where we don't have a good time estimate because there are two many unknowns in the project. (Typically because whomever is PM … really hasn't done their job in the first place of having even the foggiest understanding of what's actually involved, inevitably because the PM is non-technical. Having seen a computer is not technical.) When the work is then broken out and estimates assigned to the broken out form, the total estimate is rejected, because PMs/management don't like the number. Then inevitably a date is chosen at random by management. (And the number of times I've had a Saturday chosen is absurd, too.) And then the deadline is missed. Sometimes, projects skip right to the arbitrary deadline step, which at least cuts out some pointless debate about, yes, what you're proposing really is that complicated.
That's stressful, PMs.
¹ cough Azure cough excuse me.