I personally took it to heart, it's a good system for forcing a cache miss in the brain - make sure you're on "database production" or "database localhost" etc.
My first boss accidentally deleted our QA database, meaning to delete a local copy
A later boss accidentally deleted our production database, thinking it was the clone that he had just made (which luckily we still had)
Both of them were very experienced developers in their 40s. Nobody is beyond this kind of mistake.
Anyway, this being Linux, everyone's home directory was mounted on NFS. All our builds were standardized with a tool called SystemImager, which we could use to push out updates to everyone's desktop whenever we wanted. If there was a new version of KDE, we could pretty easily push that change out.
Sometimes it was convenient for me to work on updates to these images by chrooting into a directory containing the "image," which was really just an rsync tree. And sometimes, when updating these images, it was convenient to mount our NFS home directories in this chroot environment, so I could access things like an archive I had just downloaded on my own desktop.
And eventually we had lots of different images, and the old ones were using up a lot of disk space, so I decide to clean up some space removing the old images. And these are fairly large images, with lots of small files, and this was before SSDs were a thing, so it made sense that deleting them was taking a while, and I stepped out to grab something to eat.
As I was eating lunch, I started getting the tech support escalations. But this wasn't that unusual, our users routinely had problems with the environment we had provided. They hated it, because it was in many ways terrible, and they made sure we knew it. So I wasn't terribly alarmed. I didn't think any major changes had been made, so I didn't hurry back.
By the time I leisurely returned from lunch, half the NFS home directories for our users were gone, along with all their documents, emails, bookmarks, or whatever else. Suddenly it hit me what had happened: at some point, perhaps months earlier, I had left our NFS home directories mounted within one of these image chroots. And now I had sudo rm -rf'd it.
We had backups, but they were on tape, and it took several days to restore, with about a day of data loss.
I'd say they were experienced developers. Only after accidentally deleting databases were they very experienced developers.
Good news is that I was deleting the test database to ensure that the recovery from backups was properly automated, so it wasn't down too long.
Consider network partitioning so dev/test/accept just has 0 contact with prod.
The knobs are labeled with a terrible little glyph meant to indicate which is which, and I've supplemented this with plain-english Brady labels "front left", "front right", etc. Now I speak the words above the knob, and point to the burner. It felt goofy at first, but now it feels normal, and like I'm tempting fate if I skip it.
I've never seen a different arrangement.
Unfortunately, it strikes many as looking rather silly, so it hasn't been widely adopted.
I do this whenever I'm on a production server (which is rare anyway). I use different colored prompts for local and remote shells.
[0] Technically he had no beard and if he had, it wouldn't have been gray.
It's also not perfect; it does not catch mistakes concerning "non-local" state, e.g. configuration files in /etc merging with one in . merging with some command line options. (Personally I try to avoid writing tools with defaults of this sort, but especially Java developers seem have different opinions.)
Unfortunately if you do P&C and still make the mistake due to the aforementioned tooling, you look even stupider.
Yeah, ouch. More ouch if it's the other way around- you delete the test database and it's not the test database.
(long story)
> (long story)
I think you can skip the long story, as most of us can tell a story similar in theme if not specifics (and sometimes, probably some similar specifics too). ;)
With great power comes great responsibility (to not completely screw stuff up because you were on autopilot for a second...)
The concept makes sense, though I don't quite fully get how to translate it to other contexts besides train driving where unexpected and unpredictable events come up all the time. Let's say you're driving a car and the traffic light turns red. Do you point at the traffic light, say "red", point at your brake pedal, say "brakes", and then hit the brakes?
Getting out of your car, pressing the lock button on the inside of the driver's side door, and shutting the door are all routine, boring actions that make it easy to forget your keys inside the car. The keys can go in all kinds of places as you climb out of the car - jacket pocket, pants pocket, center console. It is very easy to lock your keys in your car.
I quickly learned to hold my keys in one hand, say out loud, "Keys in hand," and then lock the door with the other hand.
This technique is perfect for any repetitive action that could go wrong with non-trivial consequences, and there's lots of that in everyday life.
Traffic lights are a lot more random (and therefore mentally engaging) than the types of things train conductors are pointing and calling.
An automotive equivalent of a situation that would benefit from pointing and calling is something like this: https://www.consumerreports.org/car-safety/guide-to-rear-sea...
eg.: "Car parked, ignition off, get child"
https://www.youtube.com/watch?v=afjPmN0GT04
Green signals are pointed at at 2:58 and 3:29.
This idea is more useful for situations that you are initiating, and where feedback is not immediately obvious.
An example could be turning your car’s lights on at night. Before starting the car, you force yourself to point to the switch, say “lights on”, and do it.
I use this with keys. When leaving my office, house, or car, I hold up the key in my hand and establish sight (I don’t say anything out loud). Then I lock the door.
a good example from normal life is (physical) key management. I used to always forget my keys when walking out the front door, which was a big problem since it locks automatically. to solve the problem, I made my back right pocket be the designated "key pocket". I now slap my right butt cheek whenever I leave a building. it might look weird to observers, but I have not once forgotten my keys since I implemented this system.
It may seem silly, but if we asked people who drive 30+ minutes every day if they have every accidentally ran a stop sign or red light, I suspect the numbers would be quite high (though they likely happen at times/places where chance of accidents are the smallest, such as empty roads late at night).
As others have pointed out, this is for repetitive tasks that your brain wants to automate away, but you really want to keep in attention.
E.g. force yourself to read the “production” part of your prompt before running the command. Point at the user name before deleting its record. Read aloud the version name before sending it to deploy.
It really makes a different between just glancing at the info, and having to parse it as part of an action.
Now you can have the request and database administration tool open and point and call at the numbers and any queries and make sure you are deleting the right users.
I love finding out that this stuff works.
const HARD_CODE_TEST_DATABASE_FOR_SAFETY = 'unit-testing'
destroyDatabase(HARD_CODE_TEST_DATABASE_FOR_SAFETY)
1. Avoid silly terms our industry should have ditched years ago, like 'drop'2. Making sure that nobody will ever change HARD_CODE_TEST_DATABASE_FOR_SAFETY because they thought it should 'always be the active database' or whatever.
I have noticed, since learning to cook at a professional level in the kitchen, that I point and call out a lot more in my other activities too. "From hot behind" and "knife" and "oven is over temp" to "Saw blade is live" and "circuit is live" in the workshop to "production server" and "erasing records" in database maintenance. Some days I feel like Sigourney "I have one job damnit" Weaver in Galaxyquest. It's a useful stop-think-go sanity check.
Explained here: https://www.nydailynews.com/new-york/mta-conductors-point-st...
It also helps when Z results in a total meltdown and you need to pull in more people to help out, so they have context of what happened.
- I am... (who you are and where you are)
- I see... (describe what you see in simple non-ambiguous terms)
- I do... (what action you are taking now)
- I ask... (ask for reinforcements if necessary, you may be asked to justify yourself more)
That is an interesting way of looking at it.
I think a router analogy might be more precise - more like fast path / slow path - where when most packets come in they hit the fast path in hardware, and slow path exception packets get handled by the cpu.
:)
This is not a criticism of bureaucracy or regulation BTW (I'm a fan of both, in general). It's simply a recognition that there's a misalignment of objectives.
Not sure how to analyze the calculus in the case of rachaelbythebay's observation. Certainly there is one misalignment which is if the tool has sharp unprotected edges (e.g. can take the company's whole site down) the person who ran the program will be blamed, not the person who wrote it. Unless they are the same person, it's hard to get a proper feedback loop in place. The only tools we have are coding standard and code reviews: bureaucracy!
https://digital.gov/resources/paperwork-reduction-act-44-u-s...
it requires the office of management and business to calculate the impact of records-keeping requirements impact on time and privacy, among other things.
I do not believe it has resulted in a reduced recordskeeping burden. For the most part I simply see an estimate of how long it will take to complete my tax forms and permits, on the form itself. Perhaps others have different views.
Something like: ./dangerous-script.sh $args | bash
The following prefix in a ps1 script enables the -WhatIf and -Confirm parameters:
[CmdletBinding(SupportsShouldProcess=$true)]
To enable -Confirm by default for scary scripts, just use: [CmdletBinding(SupportsShouldProcess=$true,ConfirmImpact='High')]
The nice thing is that in PowerShell, unlike bash, this flows through to the vast majority of other commands. If the script has the snippet above, then you don't have to litter it with "if ( $userSaidYes ) { ... }" blocks all over the place.Similarly, PowerShell automatically wires up logic to produce all of the useful modes you might want:
[Y] Yes [A] Yes to All [N] No [L] No to All [S] Suspend
This is very fiddly to implement manually, and "Suspend" is likely impossible for most shells.See: https://docs.microsoft.com/en-us/powershell/scripting/learn/...
# rm -rf some_dir
Then if you accidentally press return before completing it hasn't happened.
When you have reviewed and are sure it is correct, you recall and delete the hash to execute - simples!
I mean, I use the # hack sometimes too, but when I don't, I find myself often being afraid of accidentally coming on the enter key.
"Run it as a query first" gets 90% of the way until you drop a constraint by accident whilst rewriting it as an update :o
alias harikb_script='harikb_script --do-it'
in their .bashrc to eliminate this annoying step.$ run-script.sh --dry run
`--dry-run` parameter not recognized
Executing ...
https://boingboing.net/2015/12/11/proposal-keep-the-nuclear-...
If you believe we should never use nuclear weapons, then don't have them at all.
If you believe there is a case where it may be moral and rational to use nuclear weapons, why would you want to put a potential barrier in the way of their use? You could have a situation where everyone was agreed to use them but the president was physically unable to harm the aide to use them.
You can know that something is the right thing to do but not have the courage to physically harm someone to do it.
An interlock that you may not be able to unlock for reasons unrelated to the task at hand is a bad interlock.
In this specific case the "thing to do" is literally to harm hundreds of thousands of people.
The reasoning behind this proposed interlock is that any logic which concludes that it is moral and rational to harm hundreds of thousands of people must also conclude that it is moral and rational to harm the "interlock" individual. Otherwise, it is likely that dropping the bomb would be a mistake.
Everybody agrees that this is a nuke-them-all situation, but the president, given himself part of the task of ripping apart human bodies, thinks more about the subject and decides a another diplomatic round is a better option.
Because you think the point where they become moral and rational to use is way way way further than commonly discussed, and you want to put many barriers of many kinds (physical, emotional, logistical) to delay their point of use without completely blocking them.
You could also say that if a person is incapable of doing the hard parts of the job, don't vote them into the position. (Downside of that is that you'll end up voting someone who doesn't mind killing someone in cold blood while expecting that to be a filter that brings more empathy to the position).
Tell that to Russia. In the short amount of time only the USA had the bomb the USA bossed them all over with threats of using it.
It's an attempt to make an abstraction concrete. Think of it as the trolley problem in real life.
Stalin is famously supposed to have said, "one death is a tragedy, 100,000 is a statistic". Cynical or not it is how humans think.
> If you believe we should never use nuclear weapons, then don't have them at all.
Strategic game theory and Mutual Assured Destruction depend on the possibility that the other guy will use them if you do, and may be the only way to prevent their use. Interestingly this is one reason why you want the other guy to know your procedures, capabilities, deployments etc. Secret weapons have no deterrent value.
The Soviet General Secretary soon receives a report about what the new policy means tactically. Americans will take several extra minutes, possibly more, to authorize retaliation. (The exact delay is subject to disagreement. Secret experiments are conducted to get the timing down. They are inconclusive.) Amid the decade's mounting tensions, a preemptive nuclear strike looks more tempting than before.
Time is also of the essence for MAD; known delay only makes MAD less effective if e.g. sub-launched cruise missiles are faster than dissection. And do all the fallback commanders need their own willing victim to mount a response?
I guess that’s why they consider the idea here and not there.
I think that there is more to that. You need to consciously type the name of the repo that you want to remove. Windows used to add a lot of jumps to get something done, and the result was mindless clicking the "yes" button and realizing 1 second later that you deleted important information.
That extra hoops need to be cognitive meaningful.
Same with Terms & Conditions. If you want your customers to truly have read and understood them, you have to show them a short quiz at the end of it. You're required to do a quiz in Europe nowadays if you want to engage in stock trading.
Half of me would want them to put `user-select: none` on that text. The other half has to archive 10+ repos and would hate that!
the entire Eastern Seaboard was without power?
https://youtu.be/XetplHcM7aQ?t=693 (James Burke's Connections, ref. cascading power cut 1965)
Basically, what happens is the brain switches operating context from "I want to do something" to "resolve this interruption (confirmation box)" and you don't relate the one to the other - you're so focused on getting rid of the interruption that the original task is forgotten until after the interruption is gone.
Then you switch back to the original task that had been interrupted by the confirmation box and then you realize you made a mistake.
It's much better to engineer "undo" ability into systems - like delaying commands (GMail's "Undo Send" does this), or caching previous state, etc.
Also, while undo is great, it's not always technically feasible. The tools in question are basically for modifying the layer that implements undo for your end users, and are often themselves fundamentally irreversible. Undo for raw hard disks involves forensic analysis at best.
No matter what tasks are required to make the dialog box go away - doing math, retyping a message, clicking a randomly ordered box - that becomes the top task in your head and you "forget" about the original task until you finish this task.
Once you resolve the interruption, you switch context back to the original task and then you still have that "oh crap" moment.
Yes, sometimes undo is very difficult, and can require a system designed to support that ability as a first-class feature from the start. Many systems you can perform rollbacks, but there are definitely destructive actions - in which case you should have test stacks to validate your actions in advance, and peer review. (e.g. dual keys to launch the missiles)
Continue: yes or no?
Don't continue: yes or no?
As long as operators know to expect this, they also know to wait and actually read the prompt before answering (as in turn of auto reaction)...
In aviation any time input is given to the machine, it's entered by one human (typically pilot flying) and then verified by the other human (typically pilot monitoring) before being committed to or executed. For example... when a new altitude is assigned by ATC, say FL300, the pilot flying will spin it in the selector window and keep his hand or finger there until the second pilot agrees with and confirms the selection by reading FL300 out of the selector window.
I know there are meat bags in these giant tubes so that changes attitudes towards safety etc. However, it seems to me that when organizations start putting the power to halt nearly the entire business in the hands of one person, there should be some slightly different attitudes. A breaking change in a million servers could easily cost hundreds of thousands or maybe even millions in lost revenue or employee productivity.
I'm just an outsider though. Perhaps this level of attention is practiced at some shops. It's just interesting to me how in some fields we settle on pretty uniform standard practices whereas others are seen as non-human-life threatening so it's just shoot first, ask questions later.
<me> team: hey, sanity check this please: hsh -A "dumb_thing && other_thing --foo --bar" <teammate> shipit
[ I type the command ]
<me> ok, running as job 1234
The last part was a courtesy done so that they could watch the progress of it too without having to dig to find my request. It also meant they could kill it easily if something went wrong and they couldn't raise me for some reason.
Tools like this are best used outside the solo realm.
In many dysfunctional orgs, having someone to blame is desirable. They will use all kinds of words for it like "accountability".
But at the end of the day, heros who take stupid risks that succeed get rewarded, cautious people that ask questions and try to understand before acting are smugly dismissed, and would-be heroes that burn the house down because of recklessness get blamed and make everyone else look good. It's all too common.
> At 9:37AM PST, an authorized S3 team member using an established playbook executed a command which was intended to remove a small number of servers for one of the S3 subsystems that is used by the S3 billing process. Unfortunately, one of the inputs to the command was entered incorrectly and a larger set of servers was removed than intended.
Terraform will perform the following actions:
# google_compute_instance.vm_instance will be created
+ resource "google_compute_instance" "vm_instance" {
+ ... <more>
Plan: 2 to add, 0 to change, 0 to destroy.
Do you want to perform these actions?
Terraform will perform the actions described above.
Only 'yes' will be accepted to approve.
Enter a value: yes[1]: https://manpages.debian.org/buster/molly-guard/molly-guard.8...
To all machines. Employee and servers alike.
Yes. Including the DNS servers.
Took them a day or two to work out how to roll that one back.
Except the change was to quarantine explorer.exe which was being changed with a patch that just got pushed out. The net result was about 6 hours of the desktop group wondering "why the hell are all of the PCs not logging in right after this patch" followed by about a month of rolling tickets from seldom used computers that had just been powered off.
His excuse was it only showed a file hash in the main screen and you had to view details to see the name plus he had a 3 day change open to roll out the system. Never understood how he got away with that one but such things did catch up to him about 2 years later.
1. Git's Force-with-lease. Git push's "force" is too powerful, you will likely regret this much power, but it's tempting. So force-with-lease is the same power but conditional on you telling git what exactly the state was that you're overriding.
This has two benefits, one is like Rachel's, it is an opportunity for a human to stop for a moment and consider, wait, why are we overriding this state? To find out what it is we might as well read... oh the state says it's an "emergency fix. Call Jerry". Maybe, just maybe, I ought to call Jerry before I force overwrite it?
But the other is about race conditions which Rachel doesn't specifically address. If you are very careful to check that the state you want to overwrite with force is indeed a state that should be overridden, nothing prevents it meanwhile changing and then you overwrote state you didn't even know existed. But force-with-lease fixes that because your lease won't match.
I believe Force-with-lease is a pattern that ought to be far more widespread. I've used several configuration management tools that let somebody say "Temporarily don't mess with config on these machines" and some of them let you write a reason like "James is rebuilding the RAID arrays" but none of them have that force-with-lease pattern that would be let me say "I know James is rebuilding the RAID arrays, this change must happen anyway but if anything else is blocking the change then reject it and let me know".
2. Prefer Undo to Confirmation. If the computer can undo the action, even if that's a bunch of work and you'd rather not bother, put that work in and enable undo. Humans always know they "really" wanted to do the thing you're asking them to confirm so it's somewhat futile to ask, but they often realise they didn't want to afterwards and will undo it if you make that possible.
Not everything can be undone. Undo factory reset isn't a thing. But lots of things you can't undo it was just laziness, try to do better in your own software. Your users (which might include you) will be grateful.
(I realize there is a possible error message case if the remote has changed... but I don’t feel like this command is the best one to use to discover whether the remote has changed, if you have no changes you actually intend to force push.)
Automation to the point of minimal human contact where you assume the human will read the presented information and make an informed decision doesn't work. The point is that we want a human to understand what is being asked, so taking some step to ensure they do understand is warranted. It will never be perfect, but adding steps like she proposes are definitely a step in the right direction, IMO.
* You could ask the operator to echo the qps figure...but really any number other than zero is likely to be an error, so it can just error out in that case without needing the confirmation.
* Even if it is serving zero qps now, if it's not explicitly drained at the load balancer, downing it is likely to be a mistake. So even better to check that.
Only once in my career have I taken down jobs serving live traffic. (They were serving 100% errors.) It was deliberate, but even so I wouldn't have minded having to supply a --yes-i-know-im-downing-live-jobs.
edit: and if for some reason my assumption is wrong and downing undrained things becomes routine...well, you'd want to fix that, but as a short term measure going back to the confirming a number rather than the force option would be appropriate. Is certainly not good to have an override that's routinely used.
How many/which companies have more than one million Linux machines?
One million is a lot no matter how you slice it.
And yeah, physical machines, not VMs. Sometimes they're blades, sometimes they're sleds, but I mean real hardware made out of metal that you can pick up and use to defend the datacenter if you have to.
(Although, honestly, I was talking about global counts in the million+ range when I wrote it since it was referencing the past, but by now, a region with a million+ is not far-fetched.)
I have to type "danger" to bypass this restriction, and I thought it was pretty cool.
Another good UI pattern is in Firefox, that it disables the Run button on downloads for a few seconds.
I'd select the dupes for merge using a checkbox, but the vendor's interface for this just had a "confirm" button. So, I confirmed. However I'd selected the "select all" box and.... confirmed. Merging every. single. record. into one (1) record.
I was fortunate, the vendor was able to roll back the changes, and nothing was lost. I also had a very good mentor-like boss who avoided reaming me out before we knew if there was a solution or not, and when there was he simply told me "I'm sure you've learned your lesson, but don't do that again."
> "This might be as simple as printing the number with your locale's version of numerical separators, like "123,456" or "123.456" or "123 456" or whatever else you might use where you are. The trick is then to NOT accept that as input, but instead demand that they remove the separator and jam it in as just digits. "
It's easier to just strip non-digit characters than to parse the input for them and respond accordingly. This is a confirmation step with basically a checksum, so you're not going to get many false positives.
Or to all the machines, on one occasion.
(It was actually some sort of race condition when we massively updated per-project access permissions and asked for SSH keys to be redeployed, but it was annoying as heck, and sure to happen whenever you really needed to access that particular machine.)
I'm sure it would also wake me up from autopilot. But I don't do this often so I can't really know. It seems like this is good enough for many people, who don't perform this action too often.
It hasn’t been an issue for me, since repo names aren’t usually super long and onerous to type.
After hitting the wrong number once, I added a confirmation that presented a random six-digit number that you had to enter before it accepted the command.
And suddenly my complex analysis prof who wrote his exams in Comic Sans is vindicated!
Thinking I can probably enhance it by forcing the user to type in the number as text rather than numeric, so they can't cut-n-paste. Kind of force them to type in "I am sure I want all data ever" or something.
Edit: added name of software
The result of one too many mindlessly accidental pushes.
A few places!? What is an example of this?
Looks like https vs http link.
Then you search the logs to see who is trying the command with the esoteric flag and "fix the glitch with payroll" for those employees.
If the tool could potentially touch a large number of machine, even if you’re super sure you got it right you should still prompt the user
"Do you care? (Y/N)"
Cattle, people. Not pets. Just make sure you don't hit all machines simultaneously and are rolling, instead.
Since the post is talking about automation anyway, assume that any machine that can go down will go down. Ensure that any such disruption will be minimal. Oops, you just killed the production database? Whatever, who cares, it has just failed over anyway (or, for a distributed one, a new node was elected, data started replicating, etc).
If one considers having to SSH to a machine to be an anti-pattern, it's amazing how much crap goes away.
In the more generalized case, where it's not about machines, then it makes more sense. Maybe you are running a query that's going to perform updates across multiple clusters. It still should not be done by hand with direct production access - unless you are in the middle of a declared (and urgent!) incident and everything is on fire. In which case there's a bunch of people watching over your shoulder (or more likely, screen sharing in a conference call).
The same job you have (hopefully) run in QA you should be able to re-target to production. Make the question just be a way to "unlock" your automation - for instance, by not copying credentials or environment information until the proper confirmation has been received. One should still have an escape hatch for when (not IF) things go wrong.