Type in the exact number of machines to proceed (opens in new tab)

(rachelbythebay.com)

554 pointsvii5y ago332 comments

332 comments

I've seen this called "pointing and calling" [1], Japan's train drivers use the technique to force themselves to perform actions and take notice of the current environment.

I personally took it to heart, it's a good system for forcing a cache miss in the brain - make sure you're on "database production" or "database localhost" etc.

[1] https://en.wikipedia.org/wiki/Pointing_and_calling

brundolf5y ago

I've only been in the job field for six years, and yet:

My first boss accidentally deleted our QA database, meaning to delete a local copy

A later boss accidentally deleted our production database, thinking it was the clone that he had just made (which luckily we still had)

Both of them were very experienced developers in their 40s. Nobody is beyond this kind of mistake.

jlmorton5y ago

War story time. Long ago, I worked for an interesting company that insisted on running its entire business on Linux desktops, all the way back between 1999-2002. Imagine running StarOffice/OpenOffice, Thunderbird, Netscape Navigator, etc, for your entire business back in 2000, including your executive team, marketing teams, everyone, most of whom had never even heard of Linux before.

Anyway, this being Linux, everyone's home directory was mounted on NFS. All our builds were standardized with a tool called SystemImager, which we could use to push out updates to everyone's desktop whenever we wanted. If there was a new version of KDE, we could pretty easily push that change out.

Sometimes it was convenient for me to work on updates to these images by chrooting into a directory containing the "image," which was really just an rsync tree. And sometimes, when updating these images, it was convenient to mount our NFS home directories in this chroot environment, so I could access things like an archive I had just downloaded on my own desktop.

And eventually we had lots of different images, and the old ones were using up a lot of disk space, so I decide to clean up some space removing the old images. And these are fairly large images, with lots of small files, and this was before SSDs were a thing, so it made sense that deleting them was taking a while, and I stepped out to grab something to eat.

As I was eating lunch, I started getting the tech support escalations. But this wasn't that unusual, our users routinely had problems with the environment we had provided. They hated it, because it was in many ways terrible, and they made sure we knew it. So I wasn't terribly alarmed. I didn't think any major changes had been made, so I didn't hurry back.

By the time I leisurely returned from lunch, half the NFS home directories for our users were gone, along with all their documents, emails, bookmarks, or whatever else. Suddenly it hit me what had happened: at some point, perhaps months earlier, I had left our NFS home directories mounted within one of these image chroots. And now I had sudo rm -rf'd it.

We had backups, but they were on tape, and it took several days to restore, with about a day of data loss.

abrookewood5y ago

That sinking feeling and cold panic when you realise what you've done. God that is horrible.

1 more reply

dasyatidprime5y ago

You probably knew this already, and there's probably better solutions if you're not in the manual sysadmin world, but after I did that on a personal machine a few decades ago (I think it was?), I got in the habit of using `--one-file-system` when doing major recursive rm operations that weren't meant to cross filesystems. Or `find -xdev … -delete` for anything more selective.

1 more reply

coredog645y ago

Similar story, except we were using an NFS appliance that took hourly snapshots. As soon as we figured out what was happening, we had the storage team save off the latest snapshot. It was 1TB of data (a lot for the time) and took a week for us to restore.

aprdm5y ago

A lot of companies still work in a similar fashion to what you described, maybe with root squashed, but still, very possible to have something like that happen now a days!

I remember someone hit a bug with docker exec --rm years ago where it started deleting some NFS files that it shouldn't...

Huggernaut5y ago

This reminds me of a time when a colleague and I were investigating some persistent D-State processes that were occurring when container processes were being exec-ed.

Once on the box, we wanted to create a container with utilities in the fs but didn't want to download an image tarball or look through the rootfs layer directories for one to use, so we just bind mounted host root onto another directory, beside the config file we were using.

This worked like a charm. Until we rm -rf'd the config directory and deleted host root in the process.

In our case, fortunately the consequences were minimal as all workloads were stateless. The container scheduler moved all the workloads to other hosts and the host scheduler noticed this VM wasn't responding any more and rolled a new one. The whole thing resolved itself in about 5 minutes with no interaction from us - so that was pretty neat.

stmw5y ago

That's a very sad worry story, hope it turned out OK. Sorry you and the users had to go through that.

eljimmy5y ago

Oh man - this one is anxiety inducing. I feel like this would haunt me for years.

brlewis5y ago

>very experienced developers in their 40s

I'd say they were experienced developers. Only after accidentally deleting databases were they very experienced developers.

PopeDotNinja5y ago

I once cloned a directory for standing up an environment via Terraform. I modified all of the environment variables and config and ran it. It worked perfectly. Except I’d forgotten to wipe out the Terraform state, which meant that in the process of creating a new environment, it completely deleted the environment I had cloned. That was my initiation into very experienced :)

BalinKing5y ago

This may not have been their first time, though :-P

1 more reply

mehrdadn5y ago

Reminds me of when I accidentally deleted a virtual hard disk I had a few years ago, because I'd copied it earlier and I thought I still had the other copy left. Only afterward did I remember I'd done the exact same thing to the other copy earlier... thankfully the information on it wasn't critical, but it was kind of terrifying to realize it very well could have been.

sgustard5y ago

I have been that boss. Is that you, Wendel? In any case: the deletion even had a "type your app name to confirm" prompt, but I knew I wanted to act on production; the issue was deleting the wrong one of multiple production databases. The takeaway was to grab a second pair of eyes to review any dangerous operations.

aidenn05y ago

I deleted our production CRM database meaning to delete the test database. While my boss was running queries on the database for setting my quarterly bonus.

Good news is that I was deleting the test database to ensure that the recovery from backups was properly automated, so it wasn't down too long.

davedx5y ago

Yup. Senior dev here, my own devops config screw up wiped out all production sales order data earlier this year. Had to restore from multiple backups, took a while. Stressful experience.

Consider network partitioning so dev/test/accept just has 0 contact with prod.

contravariant5y ago

Ironically there seems to be no time more prone to these kinds of mistakes than when you're trying to prevent or fix them.

jrott5y ago

Most of the worst production issues I've been involved with have come from trying to fix a minor issue and then somebody making a mistake. The way our brains are wired to handle stress isn't really useful for debugging complicated problems.

1 more reply

myself2485y ago

Ever since hearing about point-and-call, I've started using it in the kitchen when turning on the stove. I used to destroy one or two pans a year by turning on the wrong burner, but it's now been about a year and a half and I haven't screwed it up yet.

The knobs are labeled with a terrible little glyph meant to indicate which is which, and I've supplemented this with plain-english Brady labels "front left", "front right", etc. Now I speak the words above the knob, and point to the burner. It felt goofy at first, but now it feels normal, and like I'm tempting fate if I skip it.

giantDinosaur5y ago

I'm curious how exactly you managed to destroy pans. I've never destroyed a pan in my life, and take no particular precautions - is this a common thing? Is this more common with non-stick stuff or something?

ajb5y ago

Not the op, but non stick pans will burn if the pan is heated while empty.

1 more reply

myself2485y ago

The non-stick ones especially, but even plain metal pans will warp if they get hot enough. And then they don't sit flat on the burner, which might not matter on a gas stove, but contact with an electric burner is pretty important.

dirkt5y ago

Not sure how it is in other countries, but don't the knobs when going left-to-right always correspond clockwise to the burners, starting at the lower left? And the oven knob is to the right?

I've never seen a different arrangement.

ajanuary5y ago

They differ a lot. The first two results on Google image search for me show anti-clockwise from far left [1] and clockwise from front-left [2].

[1] https://www.blomberguk.com/appliances/integrated-appliances/... [2] https://www.ikea.com/gb/en/p/smakoka-gas-hob-stainless-steel...

dmurray5y ago

My four knobs go front to back. I don't know what order they're in - the glyphs are fairly readable to me. I've seen this arrangement plenty, it's not unique.

MaxBarraclough5y ago

Worth mentioning that, assuming the single study on the matter can be believed, the pointing and calling method is extremely effective in reducing the incidence of silly mistakes (that is, mistakes made in simple routine tasks, by competent individuals).

Unfortunately, it strikes many as looking rather silly, so it hasn't been widely adopted.

js25y ago

I learned a technique from a gray beard[0] when I worked as a student sys admin for the CS dept over two decades ago. Whenever typing a destructive command, he'd take his hands off the keyboard and drop them to his side, re-read the command, then put his hands back to press enter.

I do this whenever I'm on a production server (which is rare anyway). I use different colored prompts for local and remote shells.

[0] Technically he had no beard and if he had, it wouldn't have been gray.

encom5y ago

Re: Beards, color of:

Mine started turning grey in my mid 20s.

Could be related to me doing the electricians equivalent of deleting production DBs. I've drilled through the comms cable to payment terminals during opening hours. I've run over a copper gas line with a scissor lift. And yes, I've cut live 230V cables with hand tools.

That sinking feeling in your stomach you get immediately after doing something bad - it's universal across professions.

Thankfully, I've never fucked anything major up, and I've had my hands in hospitals, power plants, ISP fiber backbones, police stations and whatnot.

2 more replies

brundolf5y ago

Different-colored prompts for different machines is a great thing to do (I've been doing it for years), and very easy to implement

2 more replies

morelisp5y ago

A similar tip I picked up long ago: If you're typing a dangerous command, first type a `#` (or `--` if it's SQL, etc.), then the command. Then read it. Then go back to the start of the line and remove the comment and run it.

5 more replies

morelisp5y ago

I've done this for several years (also after seeing a video about Japanese railway operations). It doesn't seem to catch on.

It's also not perfect; it does not catch mistakes concerning "non-local" state, e.g. configuration files in /etc merging with one in . merging with some command line options. (Personally I try to avoid writing tools with defaults of this sort, but especially Java developers seem have different opinions.)

Unfortunately if you do P&C and still make the mistake due to the aforementioned tooling, you look even stupider.

myself2485y ago

Around industrial machines, I've long held and promoted the view that the machine is _trying_ to kill you, _trying_ to damage itself, _trying_ to ruin the workpiece. Only by outsmarting it at every turn, and having safeguards against every mishap, can you go home at the end of the day.

When something happens despite all that, just step back and realize how much worse it could've been, and how successful your safeguards have been up 'til that point.

Then look carefully at the procedure. Is there something about the naming or structure that could be more clear? Can you think of near-misses that resemble the failure you just experienced? Are you using boobytraps in production? Symlinks and overlay filesystems seem clever in the moment but they're bound to subvert our intuition someday. Perhaps you should get in the habit of always using full absolute paths, for instance.

There's always another gotcha, but if your workflow doesn't look as over-the-top safety-silly as aerospace, you're not doing as much as you could be. (Hint: It's not silly.)

blantonl5y ago

Watch and listen to pilots as they complete checklists. They point and callout each item, switch setting, etc.

waterhouse5y ago

I searched Youtube for examples of this. This is a little bit staged, but it seems to be a real checklist they're going through: https://www.youtube.com/watch?v=JG7SkOQDDt0

Though they're not perfect. They said that one pilot is supposed to read the item, the other pilot say the answer, and the first pilot visually confirm it; but at 1:42, I noticed the first pilot say "emergency exit lights", hear the confirmation, and move to the next item without her eyes moving away from the list.

I'm not sure which of several possible conclusions to draw from that. ("Humans suck", "it is indeed staged", "the procedure has enough redundancy that the chance they're both careless on a given step is small", "the pilots feel that the emergency exit lights aren't particularly important", ...)

1 more reply

morty_s5y ago

Came here for this.

A: “Passing control”

B: “Taking control”

A: “You have control”

B: “I have control”

This is how I remember it (6174, UH-1Y).

5 more replies

staunch5y ago

And pilots will even callout that their action had the desired effect:

"Flaps up selected"

"Flaps are indicating up"

There's a lot to learn from the way airplanes are engineered and operated.

3 more replies

acdha5y ago

Back when I shelled into servers more, I really liked having my deployment put the environment in the prompt and set a red background on production for similar reasons. It only takes a small change to jar you out of habit.

YeGoblynQueenne5y ago

>> I personally took it to heart, it's a good system for forcing a cache miss in the brain - make sure you're on "database production" or "database localhost" etc.

Yeah, ouch. More ouch if it's the other way around- you delete the test database and it's not the test database.

(long story)

kbenson5y ago

> you delete the test database and it's not the test database.

> (long story)

I think you can skip the long story, as most of us can tell a story similar in theme if not specifics (and sometimes, probably some similar specifics too). ;)

With great power comes great responsibility (to not completely screw stuff up because you were on autopilot for a second...)

YeGoblynQueenne5y ago

Indeed. God damn muscle memory.

throwaway8943455y ago

I worked at a company where someone deleted the production database by accident and the snapshot mechanism hadn't been working AND the alerting for the snapshot mechanism was also broken. Fortunately someone had taken a snapshot manually some weeks prior and they were able to restore from that and lose relatively little data (it was a startup, so one database was a big deal, but weeks worth of data was not such a big deal).

txutxu5y ago

I worked at a company were someone deleted the production RDS and all the snapshots.

Typing the confimation and requesting to delete the snapshots.

He had two brosers open, one for development (of cloudformation, etc)... but someone did ask him to change a thing in prod.

Both browsers were identical. Only the account in the top right corner did change.

Both cloudformation stacks were identical (instance names, etc).

He had been all the morning launching and deleting the dev environment.

Team mates were joking loud around his table before the moment it did happen.

Sadly, he got fired (the company was proud of it's cost savy choices, didn't have other backups than a few days of snapshots, probably CTO choice).

4 more replies

dheera5y ago

> I've seen this called "pointing and calling" [1], Japan's train drivers use the technique to force themselves to perform actions and take notice of the current environment.

The concept makes sense, though I don't quite fully get how to translate it to other contexts besides train driving where unexpected and unpredictable events come up all the time. Let's say you're driving a car and the traffic light turns red. Do you point at the traffic light, say "red", point at your brake pedal, say "brakes", and then hit the brakes?

apozem5y ago

In high school, I drove a 1993 Toyota Tercel. It was a functional, reliable car, but it had no keyfob to lock the doors remotely.

Getting out of your car, pressing the lock button on the inside of the driver's side door, and shutting the door are all routine, boring actions that make it easy to forget your keys inside the car. The keys can go in all kinds of places as you climb out of the car - jacket pocket, pants pocket, center console. It is very easy to lock your keys in your car.

I quickly learned to hold my keys in one hand, say out loud, "Keys in hand," and then lock the door with the other hand.

This technique is perfect for any repetitive action that could go wrong with non-trivial consequences, and there's lots of that in everyday life.

wruza5y ago

I'm always using the "phone keys cigarettes money" mantra together with patting on my pockets before opening any outside door.

2 more replies

scott_s5y ago

That’s me approaching a blue mailbox with my letter to send in one hand, and my keys in the other.

edgyquant5y ago

I just put a spare behind the license plate

1 more reply

kube-system5y ago

Repetitive tasks are exactly what pointing and calling helps with. The intent is to prevent the brain from going on autopilot for a task that happens exactly the same way 99.9% of the time, in order to prevent disasters that last 0.1% of the time.

Traffic lights are a lot more random (and therefore mentally engaging) than the types of things train conductors are pointing and calling.

An automotive equivalent of a situation that would benefit from pointing and calling is something like this: https://www.consumerreports.org/car-safety/guide-to-rear-sea...

eg.: "Car parked, ignition off, get child"

Timpy5y ago

Whenever I have something in my hand that I'm about to put down for a second in the exact absent minded kind of way that would leave me searching all over the house for it 5 minutes later, I say it out loud. "Headphones on the table by front door."

roland355y ago

Embarrassingly I once lost a hamburger while still holding it.. I had my arm propped up on a the back of the chair and it was just out of my peripheral vision. Not my smartest moment.

1 more reply

uranusjr5y ago

I believe the trick is to anticipate failure, and call out the normal thing instead. So you’d always slow down at every light, and only speed back up after calling out green. This is what all drivers are actually supposed to do, although I fully realise nobody practically does that, which is why we get so many automobile accidents all the time.

toast05y ago

Only speed back up after calling out green and intersection clear.

I don't necessarily always do that, and don't make audible calls, but when driving at night or in inclement weather, I try to make extra effort to check for unexpected cross traffic.

nemetroid5y ago

The pointing and calling performed by Japanese train drivers is very much about expected events. "Green signal" would be one of the most common call-outs. For example:

https://www.youtube.com/watch?v=afjPmN0GT04

Green signals are pointed at at 2:58 and 3:29.

bo10245y ago

Your example is a reactive event. Something happened in your environment.

This idea is more useful for situations that you are initiating, and where feedback is not immediately obvious.

An example could be turning your car’s lights on at night. Before starting the car, you force yourself to point to the switch, say “lights on”, and do it.

I use this with keys. When leaving my office, house, or car, I hold up the key in my hand and establish sight (I don’t say anything out loud). Then I lock the door.

notJim5y ago

I'm a photographer, and I used to get annoyed that I'd have little distractions on the edges and corners of the frame, because I was focussed on the subject and overall composition. I trained myself to sort of bounce my eyes around the sides of the viewfinder when pressing the shutter (think like the DVD player menu). Now I almost never forget to check.

leetcrew5y ago

I don't think it really applies to stuff like driving, which almost has to be muscle memory to work at all. even with something routine and non-urgent like switching gears in a manual, the steps have to happen faster than you can say what you're doing.

a good example from normal life is (physical) key management. I used to always forget my keys when walking out the front door, which was a big problem since it locks automatically. to solve the problem, I made my back right pocket be the designated "key pocket". I now slap my right butt cheek whenever I leave a building. it might look weird to observers, but I have not once forgotten my keys since I implemented this system.

cecilpl25y ago

After losing my wallet several times and not having a clue when the last time I had it on me was, I implemented a similar system. I now habitually triple tap my three designated pockets for phone, wallet, keys, every time I walk through a doorway.

That way, if any of them are missing, I know they must be in the room I just left.

1 more reply

cortesoft5y ago

I do a "wallet keys phone" mantra when I leave a building.... has a bit of a melody to it that I always repeat

tsomctl5y ago

I do that too. The important thing is to pat your pocket before closing the door. Twice now I've done it 2 seconds too late.

SkyBelow5y ago

Invert it and I think it works. Always prepare to stop at an intersection. Then point out it is green and call out you do not need to engage in stopping.

It may seem silly, but if we asked people who drive 30+ minutes every day if they have every accidentally ran a stop sign or red light, I suspect the numbers would be quite high (though they likely happen at times/places where chance of accidents are the smallest, such as empty roads late at night).

shezi5y ago

I teach my children to point in the direction of where cars can come from before crossing the road. He used to just swing his head around before, now he has to search directions and point there to direct his attention and it works excellently.

As others have pointed out, this is for repetitive tasks that your brain wants to automate away, but you really want to keep in attention.

hrktb5y ago

It can be used for exactly the same purpose: checking the environment before doing the action.

E.g. force yourself to read the “production” part of your prompt before running the command. Point at the user name before deleting its record. Read aloud the version name before sending it to deploy.

It really makes a different between just glancing at the info, and having to parse it as part of an action.

jrumbut5y ago

Let's say you get a request to delete users #s 1, 17, 152, and 43.

Now you can have the request and database administration tool open and point and call at the numbers and any queries and make sure you are deleting the right users.

saberdancer5y ago

OpenShift does this by forcing you to write the name of the project you are about to delete. It was something that used to annoy me but reading this I understand it is a good call from their side.

rachelbythebay5y ago

I do that when I drive around. Car on the side street. Kid over there... with a ball. Hidden left turner in 3...2...1... yep.

I love finding out that this stuff works.

nailer5y ago

I do things like

  const HARD_CODE_TEST_DATABASE_FOR_SAFETY = 'unit-testing'

  destroyDatabase(HARD_CODE_TEST_DATABASE_FOR_SAFETY)

1. Avoid silly terms our industry should have ditched years ago, like 'drop'

2. Making sure that nobody will ever change HARD_CODE_TEST_DATABASE_FOR_SAFETY because they thought it should 'always be the active database' or whatever.

justinlloyd5y ago

I have had many disasters in my software career because I jut wantonly hit "Y" without thinking about it.

I have noticed, since learning to cook at a professional level in the kitchen, that I point and call out a lot more in my other activities too. "From hot behind" and "knife" and "oven is over temp" to "Saw blade is live" and "circuit is live" in the workshop to "production server" and "erasing records" in database maintenance. Some days I feel like Sigourney "I have one job damnit" Weaver in Galaxyquest. It's a useful stop-think-go sanity check.

uyt5y ago

This is true for NYC subways too! https://www.youtube.com/watch?v=i9jIsxQNz0M

greenyoda5y ago

The video doesn't really explain why conductors point at the signs - it just says "to prove they're paying attention". Paying attention to what? The answer is that they are verifying that the train is correctly positioned in the station so that all of the doors will open on the platform.

Explained here: https://www.nydailynews.com/new-york/mta-conductors-point-st...

tialaramex5y ago

This comes up every few weeks on HN but nobody has ever offered any statistics that would suggest this is as good let alone better than just having the trains handle alignment automatically. It's a task humans are bad at and machines are good at, so just giving it to machines makes more sense, modulo unions.

London Underground hasn't had guards for decades at this point, and the Docklands Light Railway hasn't even had drivers (there is a member of staff who is trained to be able to drive it on every train, but they are usually doing other things) since its creation. If they're misaligning often enough for it to be possible for New York to be statistically better I haven't seen anything about it after repeatedly asking.

1 more reply

viraptor5y ago

I try to do that during incidents. I'm not 100% there since it's no a company rule, but it helps me at the time and later when writing up details: "I see <behaviour X>", "<Y> should fix it because <Z>", "I'm starting to do <Z> now and seeing ...", etc.

It also helps when Z results in a total meltdown and you need to pull in more people to help out, so they have context of what happened.

Qu3tzal5y ago

French firefighters do this when arriving at a scene. The first messages sent over the radio will say:

- I am... (who you are and where you are)

- I see... (describe what you see in simple non-ambiguous terms)

- I do... (what action you are taking now)

- I ask... (ask for reinforcements if necessary, you may be asked to justify yourself more)

xvf225y ago

Killed just under 1k access points when they all upgraded on one go. They had no problem erasing the firmware but when they all tried to download the new one at once it killer the service and we ended up with a lot of blank APs. The conformation message for 1 or 1000 APs is unhelpfully "This will overwrite all existing system images. Are you sure Y/N"

m4635y ago

> forcing a cache miss in the brain

That is an interesting way of looking at it.

I think a router analogy might be more precise - more like fast path / slow path - where when most packets come in they hit the fast path in hardware, and slow path exception packets get handled by the cpu.

ekanes5y ago

I do this with my kids, gesturing (not pointing) as it helps my mind remain focused on truly listening to them amid everything else going on. I probably look ridiculous, but I'm a better father for it so ¯\_(ツ)_/¯

stjohnswarts5y ago

I always called it a "that can't be right" interrogative.

xamuel5y ago

I wish it were possible for similar prompts to appear before all sorts of policy-makers and bureaucrats. "It appears you are about to institute a policy which will require 400 million patients to sign an additional waiver every time they visit a clinic, this will waste a total of 354,921 human hours within the next year alone. Please type 354,921 to proceed."

gumby5y ago

The motivations are different: the cost to the rule maker of the effort by all those people is nil. While the cost of not adding the paper is the risk of something happening in the future which could cost them their job. This is why the shoe removal theatre was added to flying: the risk of something happening is essentially nil, but if it did, heads would roll.

This is not a criticism of bureaucracy or regulation BTW (I'm a fan of both, in general). It's simply a recognition that there's a misalignment of objectives.

Not sure how to analyze the calculus in the case of rachaelbythebay's observation. Certainly there is one misalignment which is if the tool has sharp unprotected edges (e.g. can take the company's whole site down) the person who ran the program will be blamed, not the person who wrote it. Unless they are the same person, it's hard to get a proper feedback loop in place. The only tools we have are coding standard and code reviews: bureaucracy!

cortesoft5y ago

In my experience, the protections are added after a Learning Review from an incident.

Joker_vD5y ago

Yeah, it's quite surreal. "Hey, privacy is important, so let it make so that to handle people's private data, you'll need a permission from them". All right, now whenever you try to e.g. send a (paper) mail, you have to sign the waiver that yes, you do allow the post office to see and handle your name and your mail address. Not only that, all such waivers seem to be written as "I hereby allow <insert the legal entity> to handle my private data in whatever way they want to", so we're back on square one, just with more perfunctory paperwork required.

jackhack5y ago

closely related: the Paperwork Reduction Act of 1995

https://digital.gov/resources/paperwork-reduction-act-44-u-s...

it requires the office of management and business to calculate the impact of records-keeping requirements impact on time and privacy, among other things.

I do not believe it has resulted in a reduced recordskeeping burden. For the most part I simply see an estimate of how long it will take to complete my tax forms and permits, on the form itself. Perhaps others have different views.

mulmen5y ago

Hard to say, knowing the cost of a new process could have informed a new design or requirements. We don’t know what the other path held. But I believe in general having more information allows us to make better decisions so this is a good act.

mulmen5y ago

How do you know it was a waste? Maybe that was time well spent.

harikb5y ago

I have a habit of creating cli tools, which potentially do dangerous things, to default to dry-run mode. For example, instead of the typical `--dry-run` or `-n` option, my scripts instead had a cheesy `--do-it` to be non-dry-run. It is annoying as hell to my colleagues, but saved the day many times.

PureParadigm5y ago

A coworker of mine would write all his bash scripts to echo out the commands it would run, and then to actually run it he would pipe it to bash. This way he could inspect the commands to make sure they were correct before running them.

Something like: ./dangerous-script.sh $args | bash

GauntletWizard5y ago

I would love a shell that allows you to “run” a script in manual mode - Where at the end of every command, every statement, it prints what the next command will be with all variables expanded or otherwise called out, and then requires you to hit “enter” to cause it to proceed. I write a decent amount something between README and Shell Script. I’ve already got an awk one-liner that parses the shell out of Markdown. I typically copy+paste, line-by-line, from my README and add a bunch of echo statements to verify what i’m doing.

efreak5y ago

Press f8 to process autoexec?

tomjakubowski5y ago

Is your coworker Willard Van Orman Quine?

dredmorbius5y ago

Same, or save to a file, temporarily, check that, then run the resulting script.

meesterdude5y ago

wow that's so clever and simple! Love it.

jacobwilliamroy5y ago

I also do this.

jiggawatts5y ago

In PowerShell, this is a native feature of the entire shell and hence scripts and commands.

The following prefix in a ps1 script enables the -WhatIf and -Confirm parameters:

    [CmdletBinding(SupportsShouldProcess=$true)]

To enable -Confirm by default for scary scripts, just use:

    [CmdletBinding(SupportsShouldProcess=$true,ConfirmImpact='High')]

The nice thing is that in PowerShell, unlike bash, this flows through to the vast majority of other commands. If the script has the snippet above, then you don't have to litter it with "if ( $userSaidYes ) { ... }" blocks all over the place.

Similarly, PowerShell automatically wires up logic to produce all of the useful modes you might want:

    [Y] Yes  [A] Yes to All  [N] No  [L] No to All  [S] Suspend

This is very fiddly to implement manually, and "Suspend" is likely impossible for most shells.

See: https://docs.microsoft.com/en-us/powershell/scripting/learn/...

yobert5y ago

I did this with "--im-not-scared" for production mode :D

csours5y ago

We had one that required "BADIDEA" to run

dmuth5y ago

I do something similar with my scripts, but have `--go` action, even on a script that requires no other options, just so that if it's run without any options, the person running it gets a message saying what the script WOULD do, if `--go` were passed in.

hotsauceror5y ago

I do the same thing. All of my scripts have a -defang parameter which walks through the entire process, including placeholder log messages, but not actually performing the operation. My run books always say to run your exact command with this switch first, to proofread it. For some dangerous scripts, defang is enabled and has to be manually turned off. Defang is also nice because it will tell you e.g. here’s the size of the backup you’ll be restoring, or the filepath you’ve composed based on your parameters, or confirming that you’ll be replacing an existing thing instead of creating a new one. It has saved me many, many times.

robaato5y ago

Bash tip I picked up from observation - always start a potential command with #

# rm -rf some_dir

Then if you accidentally press return before completing it hasn't happened.

When you have reviewed and are sure it is correct, you recall and delete the hash to execute - simples!

arendtio5y ago

In my opinion, the option -r should only be allowed as the last parameter. Maybe with the exception of -f. Everything else is just f*ing dangerous.

I mean, I use the # hack sometimes too, but when I don't, I find myself often being afraid of accidentally coming on the enter key.

stjohnswarts5y ago

I generally throw up a status report type of thing "you are applying $this_operation to $this_many_machines on $this_farm. Continue (yes/no)?" and enforce yes/no full typing. Anything other than yes is a no

matart5y ago

Does this work with autocomplete?

greenyoda5y ago

Just tried it with bash on Linux, and apparently autocomplete works in a comment.

jrumbut5y ago

Even having a dry run mode is exciting. Doesn't even have to give complete results just "I was planning to delete 3 files and create 7 files", gives a hint whether the command will blow up the system or not.

dingaling5y ago

I wish SQL had a dry-run mode in updates and deletes for that reason.

"Run it as a query first" gets 90% of the way until you drop a constraint by accident whilst rewriting it as an update :o

harikb5y ago

For interactive queries / surgery, you do have an option with a transaction (begin/commit/abort).

If it is Postgres (don't know about other dbs), you can go a way long way using "savepoints" and "rollbacks" to truly have a trial-and-error safe surgery on db. Still dangerous, but quite helpful. I hate working on any other db without those features. Postgres also allows schema changes to be within a txn envelope.

vlunkr5y ago

I've thought the same thing. I also wish SET came after where. I've done "UPDATE table_x SET something = true"; and then forgot the WHERE clause.

krab5y ago

Transactions and rollback is the dry run. The problem is that if you keep the transaction open for too long, you will block other updates to the same data.

1 more reply

skymt5y ago

Enough folks have replied that transactions are the way to go, but I just wanted to add that whatever interface tool you use for your database may have an option to force you to commit your transactions manually. For example PostgreSQL's default 'psql' shell has the "autocommit" option which, when disabled, requires you to manually 'commit;' before any changes take effect.

SkyBelow5y ago

I think an improvement to SQL would be for insert/update/delete clauses to require a where clause and allow for something like 1=1 if you really intend to hit all rows. A safe but even more invasive would be requiring an end to the were clause as well (to prevent selecting a few but not all constraints).

1 more reply

verve_rat5y ago

Wrap it in a transaction and roll back the transaction at the end. Then remove the transaction when you are ready to do it for real.

You can jam a select in the end of the transaction to check what happens.

cbm-vic-205y ago

MySQL has a command line option "--i-am-a-dummy" (aka "--safe-updates") for exactly this purpose.

https://dev.mysql.com/doc/refman/8.0/en/mysql-command-option...

austinl5y ago

I like this format in general, since it communicates the command is severe/irreversible. Heroku implements a similar confirmation when performing destructive actions. Commands require your to pass a `--confirm ${APP NAME}` flag, so the original command itself does nothing. Of course, this doesn't prevent you including those flags in makefiles, etc. I once dropped a table in a side project by accident because I took the wrong tab autocomplete suggestion in a makefile.

leetcrew5y ago

works great until some asshole puts

  alias harikb_script='harikb_script --do-it'

in their .bashrc to eliminate this annoying step.

actuallyalys5y ago

I suspect someone who'd do that isn't going to take that or other precautions seriously regardless of it being aliased. It's still a problem that they're circumventing it, but I think you have a larger problem if someone with that mindset has access to production.

xaedes5y ago

This would help a bit: Don't accept the "--do-it" as first parameter, make it obligatory to be the last.

X6S1x6Okd1st5y ago

If someone is a programmer and is trying to disable safety features making it slightly harder to do so doesn't really seem like the solution.

_ikke_5y ago

  my_command() {
      command my_command "$@" --do-it
  }

1 more reply

Xophmeister5y ago

We've been known to use something like --yes-i-really-mean-it-this-time for really dangerous options. It's a like built-in solemnisation step.

vehementi5y ago

I once came across one like this

$ run-script.sh --dry run

`--dry-run` parameter not recognized

Executing ...

roydivision5y ago

Reminds me of the proposal to keep the nuclear launch codes inside the body of an innocent volunteer, so the President would have to kill the person to get the codes.

https://boingboing.net/2015/12/11/proposal-keep-the-nuclear-...

chrisseaton5y ago

I've never understood this idea.

If you believe we should never use nuclear weapons, then don't have them at all.

If you believe there is a case where it may be moral and rational to use nuclear weapons, why would you want to put a potential barrier in the way of their use? You could have a situation where everyone was agreed to use them but the president was physically unable to harm the aide to use them.

You can know that something is the right thing to do but not have the courage to physically harm someone to do it.

An interlock that you may not be able to unlock for reasons unrelated to the task at hand is a bad interlock.

shuntress5y ago

>You can know that something is the right thing to do but not have the courage to physically harm someone to do it.

In this specific case the "thing to do" is literally to harm hundreds of thousands of people.

The reasoning behind this proposed interlock is that any logic which concludes that it is moral and rational to harm hundreds of thousands of people must also conclude that it is moral and rational to harm the "interlock" individual. Otherwise, it is likely that dropping the bomb would be a mistake.

chrisseaton5y ago

> The reasoning behind this proposed interlock is that any logic which concludes that it is moral and rational to harm hundreds of thousands of people must also conclude that it is moral and rational to harm the "interlock" individual.

Yes, but you can know it's the right thing to do, but not be able to physically do it.

The president's ability to physically cut someone open is not relevant to whether it's a good idea to use nuclear weapons or not. Him being unable to do it tells you nothing about whether they should be launching the weapons.

If the president fails the test that tells you nothing about whether the launch is the right thing to do. Doesn't that fundamentally make the test bad?

4 more replies

motoboi5y ago

You put too much confidence in human reason.

Everybody agrees that this is a nuke-them-all situation, but the president, given himself part of the task of ripping apart human bodies, thinks more about the subject and decides a another diplomatic round is a better option.

FactCore5y ago

I think that's the point. I'm personally not an advocate of this because it seems to be a little too "beat you over the head" with its moral metaphor, but the whole point is that the President should have to personally kill someone to understand the gravity of what they are about to do.

From the perspective of an advocate I'd say: If they can't come to terms with killing one, who are they to execute hundreds of thousands?

jodrellblank5y ago

> "If you believe there is a case where it may be moral and rational to use nuclear weapons, why would you want to put a potential barrier in the way of their use?"

Because you think the point where they become moral and rational to use is way way way further than commonly discussed, and you want to put many barriers of many kinds (physical, emotional, logistical) to delay their point of use without completely blocking them.

You could also say that if a person is incapable of doing the hard parts of the job, don't vote them into the position. (Downside of that is that you'll end up voting someone who doesn't mind killing someone in cold blood while expecting that to be a filter that brings more empathy to the position).

greggman35y ago

> If you believe we should never use nuclear weapons, then don't have them at all.

Tell that to Russia. In the short amount of time only the USA had the bomb the USA bossed them all over with threats of using it.

gumby5y ago

> I've never understood this idea.

It's an attempt to make an abstraction concrete. Think of it as the trolley problem in real life.

Stalin is famously supposed to have said, "one death is a tragedy, 100,000 is a statistic". Cynical or not it is how humans think.

> If you believe we should never use nuclear weapons, then don't have them at all.

Strategic game theory and Mutual Assured Destruction depend on the possibility that the other guy will use them if you do, and may be the only way to prevent their use. Interestingly this is one reason why you want the other guy to know your procedures, capabilities, deployments etc. Secret weapons have no deterrent value.

chrisseaton5y ago

> Think of it as the trolley problem in real life.

Well exactly... doesn't that show you that it's a bad idea? People don't know if they could bring themselves to throw the switch even if everyone thinks it makes rational sense.

You're taking a rational, well-considered, strategic decision... and making the interlock a messy personal emotional one unrelated to the actual issue at hand. That sounds like the wrong way around to be doing things?

1 more reply

mikewarot5y ago

Trolley Problems are themselves a bad idea... the Kobayashi Maru is a similar exercise. I, like Kirk, don't believe that there are situations that can't be worked around if there is time to think, and resources to act.

1 more reply

networked5y ago

It's the 1980s, and the United States implements this policy. What happens on the Soviet side? After the United States' announcement the Soviet press and Soviet sympathizers worldwide gasp loudly in horror. "How cruel are Americans, really? Is the barbaric act of murdering and butchering an innocent young man the only thing still able to keep their president from destroying our Earth?"

The Soviet General Secretary soon receives a report about what the new policy means tactically. Americans will take several extra minutes, possibly more, to authorize retaliation. (The exact delay is subject to disagreement. Secret experiments are conducted to get the timing down. They are inconclusive.) Amid the decade's mounting tensions, a preemptive nuclear strike looks more tempting than before.

benlivengood5y ago

Too bad sociopaths and narcissists are more common in positions of power. All it would do is uselessly kill a volunteer.

Time is also of the essence for MAD; known delay only makes MAD less effective if e.g. sub-launched cruise missiles are faster than dissection. And do all the fallback commanders need their own willing victim to mount a response?

Aeolun5y ago

I dunno, Putin? Yes. Trump, shmaybe? Obama, not really.

I guess that’s why they consider the idea here and not there.

dgritsko5y ago

Similar idea as GitHub's "type the exact name of this repository if you want to delete it" confirmation dialog. Maybe that's really what you want to do, but in case that's not actually what you meant to do, having a few extra hoops to jump through seems like a good idea.

Hokusai5y ago

> having a few extra hoops to jump through seems like a good idea.

I think that there is more to that. You need to consciously type the name of the repo that you want to remove. Windows used to add a lot of jumps to get something done, and the result was mindless clicking the "yes" button and realizing 1 second later that you deleted important information.

That extra hoops need to be cognitive meaningful.

Cthulhu_5y ago

Yes, and infrequent; the main issue with Windows (Vista mainly) was that it appeared far too often. Even with 7, when you're setting it up for the first time for example, I think it shows up too often.

Same with Terms & Conditions. If you want your customers to truly have read and understood them, you have to show them a short quiz at the end of it. You're required to do a quiz in Europe nowadays if you want to engage in stock trading.

segfaultbuserr5y ago

Some disk management software also has "type the exact label of this partition to reformat it" to prevent accidental data loss.

wjdp5y ago

Do you type the repo name, or just copy/paste or select/middle click it?

Half of me would want them to put `user-select: none` on that text. The other half has to archive 10+ repos and would hate that!

edanm5y ago

That's what I thought of immediately as well! I've seen that pattern in a few other places too, and I always think it's a really good UX choice.

luhn5y ago

One of the largest AWS outages to date was caused by a scenario like this. [1] A mistyped commanded removed too many servers from an S3 subsystem, overloading the remaining servers and crashing the subsystem. The failure snowballed until the entire S3 region was down, which then caused issues with dependent services like EBS, ALB, and Lambda. They couldn't even update the status page because that also depended on S3.

[1] https://aws.amazon.com/message/41926/

HenryKissinger5y ago

I remember that. The AWS dashboard was all green checkmarks... because the red checkmarks icons the dashboard was supposed to display were stored inside the crashed servers.

jodrellblank5y ago

>"overloading the remaining servers and crashing the subsystem. The failure snowballed until"

the entire Eastern Seaboard was without power?

https://youtu.be/XetplHcM7aQ?t=693 (James Burke's Connections, ref. cascading power cut 1965)

jasonpeacock5y ago

Raskin talks about the futility of this in his book The Humane Interface.

Basically, what happens is the brain switches operating context from "I want to do something" to "resolve this interruption (confirmation box)" and you don't relate the one to the other - you're so focused on getting rid of the interruption that the original task is forgotten until after the interruption is gone.

Then you switch back to the original task that had been interrupted by the confirmation box and then you realize you made a mistake.

It's much better to engineer "undo" ability into systems - like delaying commands (GMail's "Undo Send" does this), or caching previous state, etc.

andrewflnr5y ago

That's exactly why it's not a "confirmation box", but requires you to slow down and think for half a second. She even talked about mitigating copy-paste, which is the next obvious way people could habituate.

Also, while undo is great, it's not always technically feasible. The tools in question are basically for modifying the layer that implements undo for your end users, and are often themselves fundamentally irreversible. Undo for raw hard disks involves forensic analysis at best.

jasonpeacock5y ago

The problem (I probably didn't paraphrase Raskin well) is when you slow down & think for a half a second, you context switch from "I need to do operation" to "I need to make this dialog box go away".

No matter what tasks are required to make the dialog box go away - doing math, retyping a message, clicking a randomly ordered box - that becomes the top task in your head and you "forget" about the original task until you finish this task.

Once you resolve the interruption, you switch context back to the original task and then you still have that "oh crap" moment.

Yes, sometimes undo is very difficult, and can require a system designed to support that ability as a first-class feature from the start. Many systems you can perform rollbacks, but there are definitely destructive actions - in which case you should have test stacks to validate your actions in advance, and peer review. (e.g. dual keys to launch the missiles)

robaato5y ago

Or you have commands which randomly reverse the meaning of the confirmation prompt:

Continue: yes or no?

Don't continue: yes or no?

As long as operators know to expect this, they also know to wait and actually read the prompt before answering (as in turn of auto reaction)...

bronco210165y ago

It amazes me that something like this can be done by a single person.

In aviation any time input is given to the machine, it's entered by one human (typically pilot flying) and then verified by the other human (typically pilot monitoring) before being committed to or executed. For example... when a new altitude is assigned by ATC, say FL300, the pilot flying will spin it in the selector window and keep his hand or finger there until the second pilot agrees with and confirms the selection by reading FL300 out of the selector window.

I know there are meat bags in these giant tubes so that changes attitudes towards safety etc. However, it seems to me that when organizations start putting the power to halt nearly the entire business in the hands of one person, there should be some slightly different attitudes. A breaking change in a million servers could easily cost hundreds of thousands or maybe even millions in lost revenue or employee productivity.

I'm just an outsider though. Perhaps this level of attention is practiced at some shops. It's just interesting to me how in some fields we settle on pretty uniform standard practices whereas others are seen as non-human-life threatening so it's just shoot first, ask questions later.

rachelbythebay5y ago

Best practice for using the "weaponized" version of the tool when you had powers to actually hit all of them at once was to paste the command into IRC and get some of your fellow peeps to eyeball it and make sure it was sane.

<me> team: hey, sanity check this please: hsh -A "dumb_thing && other_thing --foo --bar" <teammate> shipit

[ I type the command ]

<me> ok, running as job 1234

The last part was a courtesy done so that they could watch the progress of it too without having to dig to find my request. It also meant they could kill it easily if something went wrong and they couldn't raise me for some reason.

Tools like this are best used outside the solo realm.

im3w1l5y ago

I think an automated tool would be preferable since there is no 100% foolproof guarantee that what you type in irc is the same as what you type in the terminal.

crispyambulance5y ago

> It amazes me that something like this can be done by a single person.

In many dysfunctional orgs, having someone to blame is desirable. They will use all kinds of words for it like "accountability".

But at the end of the day, heros who take stupid risks that succeed get rewarded, cautious people that ask questions and try to understand before acting are smugly dismissed, and would-be heroes that burn the house down because of recklessness get blamed and make everyone else look good. It's all too common.

cle5y ago

In shops where stakes are high, it’s not uncommon to do just like you said—have mechanisms that force someone else to verify what you’re about to do, before you do it. If someone else can’t verify, the tool will block you. It’s similar in spirit to requiring code reviews on all shipped code.

illumin85y ago

This is a great idea, and I'd like to point out that having such a system in place would have prevented one of the largest Internet outages in recent memory - the Amazon S3 outage in 2017: https://aws.amazon.com/message/41926/

> At 9:37AM PST, an authorized S3 team member using an established playbook executed a command which was intended to remove a small number of servers for one of the S3 subsystems that is used by the S3 billing process. Unfortunately, one of the inputs to the command was entered incorrectly and a larger set of servers was removed than intended.

zedpm5y ago

It's kind of funny, since various operations performed in the AWS web console use this model (e.g. type the name of the resource you're trying to delete). As an organization, they're aware of this approach and think it's useful, but (presumably) didn't use it in their own internal tooling.

audience_mem5y ago

Perhaps those were added after they learnt their lesson.

educationcto5y ago

Terraform prints out the number of resources changed and at least requires a "yes" to proceed. Not quite as onerous as described but at least prevents some type of fat-fingering. Basically all changes with Terraform are risky as they usually involved bringing up and down infrastructure.

   Terraform will perform the following actions:

  # google_compute_instance.vm_instance will be created
  + resource "google_compute_instance" "vm_instance" {
  + ... <more>
 
   Plan: 2 to add, 0 to change, 0 to destroy.

   Do you want to perform these actions?
    Terraform will perform the actions described above.
    Only 'yes' will be accepted to approve.

   Enter a value: yes

caymanjim5y ago

This is exactly the problem the author is referring to. With Terraform, you always type "yes" to proceed, so it turns into muscle memory. You stop reading the output, and you're already typing "yes" before you even see the prompt. Terraform's output is also verbose, and many changes show up as "1 to add, 0 to change, 1 to destroy" because they don't separately list a "replace" category. It's pretty bad; you've got cognitive overload, confusing output summary, and a predetermined continue answer. And this is often an action you're performing under duress. I've been bitten by it plenty of times.

brodouevencode5y ago

IaC is a real time saver, but inherently dangerous.

remram5y ago

A similar system is molly-guard [1], which replaces the reboot/halt/poweroff/... commands with scripts that make you type in the name of the machine before proceeding. Avoids shutting down the wrong machine because you forgot where you SSH'd.

[1]: https://manpages.debian.org/buster/molly-guard/molly-guard.8...

b6z5y ago

Many years ago, I made that mistake two or three times, rebooting the wrong machine. Since then, I use molly-guard on all my remote machines. Never happened again.

Darkphibre5y ago

Reminds me of when the Fortune 50 company (150k employees) I worked for rolled out new firewall restrictions that blocked the DNS port.

To all machines. Employee and servers alike.

Yes. Including the DNS servers.

Took them a day or two to work out how to roll that one back.

zamadatix5y ago

The first use of a new security product my manager insisted we roll out (as a duplicate to an existing tool from another group) was to quarantine a change in a system file that seemed to be spreading through all of the PCs.

Except the change was to quarantine explorer.exe which was being changed with a patch that just got pushed out. The net result was about 6 hours of the desktop group wondering "why the hell are all of the PCs not logging in right after this patch" followed by about a month of rolling tickets from seldom used computers that had just been powered off.

His excuse was it only showed a file hash in the main screen and you had to view details to see the name plus he had a 3 day change open to roll out the system. Never understood how he got away with that one but such things did catch up to him about 2 years later.

tialaramex5y ago

So, related obviously correct designs:

1. Git's Force-with-lease. Git push's "force" is too powerful, you will likely regret this much power, but it's tempting. So force-with-lease is the same power but conditional on you telling git what exactly the state was that you're overriding.

This has two benefits, one is like Rachel's, it is an opportunity for a human to stop for a moment and consider, wait, why are we overriding this state? To find out what it is we might as well read... oh the state says it's an "emergency fix. Call Jerry". Maybe, just maybe, I ought to call Jerry before I force overwrite it?

But the other is about race conditions which Rachel doesn't specifically address. If you are very careful to check that the state you want to overwrite with force is indeed a state that should be overridden, nothing prevents it meanwhile changing and then you overwrote state you didn't even know existed. But force-with-lease fixes that because your lease won't match.

I believe Force-with-lease is a pattern that ought to be far more widespread. I've used several configuration management tools that let somebody say "Temporarily don't mess with config on these machines" and some of them let you write a reason like "James is rebuilding the RAID arrays" but none of them have that force-with-lease pattern that would be let me say "I know James is rebuilding the RAID arrays, this change must happen anyway but if anything else is blocking the change then reject it and let me know".

2. Prefer Undo to Confirmation. If the computer can undo the action, even if that's a bunch of work and you'd rather not bother, put that work in and enable undo. Humans always know they "really" wanted to do the thing you're asking them to confirm so it's somewhat futile to ask, but they often realise they didn't want to afterwards and will undo it if you make that possible.

Not everything can be undone. Undo factory reset isn't a thing. But lots of things you can't undo it was just laziness, try to do better in your own software. Your users (which might include you) will be grateful.

coder5435y ago

Related but semi-random: it slightly annoys me that force-with-lease goes through the entire effort of force pushing if it thinks the remote is identical to the local. It’s not going to change anything either way, and it could save me the second or two of waiting on it to do nothing. If local is already identical to the last known state of the remote, and I’m trying to force push, the actual error is that I didn’t edit the local branch in the way I thought I had when I decided it was time to force push.

(I realize there is a possible error message case if the remote has changed... but I don’t feel like this command is the best one to use to discover whether the remote has changed, if you have no changes you actually intend to force push.)

vondur5y ago

That may have helped when Emory University's IT dept. accidentally sent a wipe and reformat command using Microsoft's SCCM to all of the Windows computers and servers on campus back in 2014. https://it.slashdot.org/story/14/05/17/051214/emory-universi...

kbenson5y ago

This is a topic near and dear to my heart, as I'm often that person arguing to make some slightly less automated because the small trade-off in time is insurance against some of the worst mistakes you can have. Automation to the point of removing humans leads to stupid problems that a human wouldn't make if they looked at what was going on. So we automate tot he point where we minimize human contact, presenting a summary of actions that as humans we can apply our wonderful brains to and prevent those problems. Except some percentage of the time we don't actually pay attention, and depending on how the human interaction was introduced instead of complete automation, some percentage (or multiple!) of errors still sneak through.

Automation to the point of minimal human contact where you assume the human will read the presented information and make an informed decision doesn't work. The point is that we want a human to understand what is being asked, so taking some step to ensure they do understand is warranted. It will never be perfect, but adding steps like she proposes are definitely a step in the right direction, IMO.

rossjudson5y ago

This resonates with me. Years ago I took down a service in a cell accidentally (Googlers might empathize: never 'borg' when you meant to 'borgcfg'). If I had been asked to enter the exact number of tasks I was about to nuke, I might have thought twice ;)

scottlamb5y ago

I've certainly deliberately downed an enormous number of tasks, though, as part of a cluster turn-down. I love the technique of requiring the operator to echo a key fact, but in the case you're describing I think the key fact is not how many tasks but that that they're serving live traffic. So:

* You could ask the operator to echo the qps figure...but really any number other than zero is likely to be an error, so it can just error out in that case without needing the confirmation.

* Even if it is serving zero qps now, if it's not explicitly drained at the load balancer, downing it is likely to be a mistake. So even better to check that.

Only once in my career have I taken down jobs serving live traffic. (They were serving 100% errors.) It was deliberate, but even so I wouldn't have minded having to supply a --yes-i-know-im-downing-live-jobs.

edit: and if for some reason my assumption is wrong and downing undrained things becomes routine...well, you'd want to fix that, but as a short term measure going back to the confirming a number rather than the force option would be appropriate. Is certainly not good to have an override that's routinely used.

jeffbee5y ago

The way we approached this on my SRE team was semi-manual with improved ergonomics. We embedded the live traffic graph in the turndown tool, so it would be right in your face before you took the destructive action. Of course it was always possible to go one level down on the tooling and do everything manually, but it wasn't the usual way.

scottlamb5y ago

Seems reasonable, but as you might have seen, rossjudson did accidentally-ish go to a lower layer: he wrote "never 'borg' when you meant to 'borgcfg'". And you're still relying on someone actually looking at the graph in their face which isn't as sure a thing as it'd be if they had to echo something back as Rachel is advocating for.

(For the benefit of non-Googlers/Xooglers: borg is a lower-level tool mostly used when everything else has gone wrong and borgcfg is a higher-level, more routine tool. These days people often layer things on top of that as well, because we love piling up abstraction layers. This approach is completely successful because abstraction layers never leak and solve every problem without making anything hard to debug at all. /s)

In my ideal world, even the lowest layer a human ever uses would do safety checks by default. Eg, imagine if the job specification included "query this safety check service on change" and the borg tool (as part of querying the existing job on a cancel/rm command) discovered that and honored it. Most people/jobs would use a safety check that fails taking down a job unless the load balancer reports all relevant services have that job drained. The safety check service could also specify a confirmation prompt (similar to what Rachel is advocating) that could be customizable (like qps or percent of global capacity rather than just number of tasks). The safety check would be effective no matter what layer you use, and there'd be no good reason to use one that would cause prompt fatigue. The outage rossjudson described (and I know he's not the only one who has done exactly this!) would have been avoided.

1 more reply

gabeio5y ago

I do like this idea, this is I assume why github makes you type the repo name out in full. I wish AWS followed suit, when deleting any RDS (database) instance on AWS all you have to type is "delete me"... very easy to copy and paste as well as just know what you need to type and be on autopilot. I have even poked support about it and their response was underwhelming.

jaclaz5y ago

Side question.

How many/which companies have more than one million Linux machines?

notacoward5y ago

At least Facebook (where OP worked), Amazon, Google, and Microsoft. Probably Netflix, maybe Apple. There might be a couple more, but no more than that because we've already accounted for a pretty high percentage of worldwide shipments for servers, disks, etc. Fun fact: when you're that big, your demand creates its own inflation and you have to consider that in projections.

kube-system5y ago

If by "machine" we also mean things outside of a 19" rack, I would wager that large telecoms probably have way more devices running Linux than FAANG. Imagine the network of cable modems that Comcast alone must operate. What percentage of their 28+ million broadband customers rent Comcast owned/managed modems? Almost all of them except the tech-savvy crowd? And that's just one device type.

InitialLastName5y ago

Not to mention the networks of cellular base stations worldwide that run extremely sophisticated systems (if not Linux itself).

jaclaz5y ago

Thanks, so a handful at most, and the "usual" ones, I always thought that those companies keep their machines connected in (redundant) "sets" and that a command affecting all of them was more a case for "never" rather than "once in a while".

jeffbee5y ago

Google, at least, has a thing that is supposed to prevent widespread disruption at the machine level, called the "Safe Removal Service"[1]. This is a good idea that in practice isn't perfect. If you write a tool that does not consult SRS, or your service doesn't declare a SRS policy, there can be surprises.

A particular outage that I will never forget took out Gmail delivery worldwide in an instant, because the change was not expected to be disruptive and therefore did not integrate with SRS. As it turned out the change disabled the machines where it was applied, and the process of selecting a subset of machines to canary the change was not independent of the way in which Gmail assigns services to machines, so in the space of a few seconds they created a global outage.

https://twitter.com/bgrant0607/status/1134536670504554496

abnry5y ago

The number blew me away. But does she mean in one location or VMs?

One million is a lot no matter how you slice it.

rachelbythebay5y ago

How do you define one location? If it's like, a contiguous plat of land with a bunch of buildings, each containing suites, and each of those containing clusters... then these days, yeah, that's probably not too much of a stretch.

And yeah, physical machines, not VMs. Sometimes they're blades, sometimes they're sleds, but I mean real hardware made out of metal that you can pick up and use to defend the datacenter if you have to.

(Although, honestly, I was talking about global counts in the million+ range when I wrote it since it was referencing the past, but by now, a region with a million+ is not far-fetched.)

Ayesh5y ago

I have an old laptop with a dead battery, and for a BIOS upgrade, it prevents me from updating without 50% battery.

I have to type "danger" to bypass this restriction, and I thought it was pretty cool.

Another good UI pattern is in Firefox, that it disables the Run button on downloads for a few seconds.

duskwuff5y ago

Disabling the "run" button for a few seconds was actually done to mitigate another risk -- sites cueing the user to click in a particular location, then triggering the confirmation dialog with the "run" button right where the user was about to click.

ineedasername5y ago

Oh god this would have saved me so much stress once. It was early in my career, and part of my duties was to run a merge/purge process on dupe records.

I'd select the dupes for merge using a checkbox, but the vendor's interface for this just had a "confirm" button. So, I confirmed. However I'd selected the "select all" box and.... confirmed. Merging every. single. record. into one (1) record.

I was fortunate, the vendor was able to roll back the changes, and nothing was lost. I also had a very good mentor-like boss who avoided reaming me out before we knew if there was a solution or not, and when there was he simply told me "I'm sure you've learned your lesson, but don't do that again."

aqme285y ago

Nitpicking

> "This might be as simple as printing the number with your locale's version of numerical separators, like "123,456" or "123.456" or "123 456" or whatever else you might use where you are. The trick is then to NOT accept that as input, but instead demand that they remove the separator and jam it in as just digits. "

It's easier to just strip non-digit characters than to parse the input for them and respond accordingly. This is a confirmation step with basically a checksum, so you're not going to get many false positives.

Kerrick5y ago

Stripping the non-digit characters would allow "123,456" to validate instead of only accepting "123456" -- which defeats the whole purpose of printing the number with numerical separators (to prevent copy/paste).

aqme285y ago

If you're worried about copy-paste, make it a random code.

nemo16185y ago

Notably, Discord does something like this when you @everyone in a large channel: "You're about to push a notification to 12,000 people, are you sure you want to do that...?"

pwinnski5y ago

Sounds like a yes/no answer is expected? If so, that is exactly what Rachel is suggesting is not enough.

jerf5y ago

In this case, usually the very fact that a popup unexpectedly popped up is enough. I use Konsole as my main shell, and like several other shells now it has a "You're about to paste 100KB, yes/no?", and I don't mindlessly click "yes" because it is already a "cache miss" to see that dialog at all.

raverbashing5y ago

Slack should take a note of this. Especially for rogue @here notifications

tigger0jk5y ago

I've typically used pdsh https://github.com/chaos/pdsh for these types of commands, and I don't think they have any such safety options. The only protection is to be wracked with fear whenever you type pdsh. Obviously this fear wanes with use, and eventually you don't think about a command for long enough before you do it and hit enter on a regrettable one.

cle5y ago

Even better than you confirming your own action, is someone else confirming it. If the stakes are high, require two people to turn the keys, instead of just one.

rcarmo5y ago

This reminded me that a few years back I worked at a place where (notoriously) Puppet would occasionally go over some random box and remove access to people, just because.

Or to all the machines, on one occasion.

(It was actually some sort of race condition when we massively updated per-project access permissions and asked for SSH keys to be redeployed, but it was annoying as heck, and sure to happen whenever you really needed to access that particular machine.)

lqet5y ago

Github has been doing this for quite a while know when you try to delete a repository - you have to type in the exact repository name to confirm.

bmaupin5y ago

Which I always mindlessly copy and paste...

jraph5y ago

But maybe this is enough? I do this too, but this gives me time to actually read the repo name twice. It's way better than a confirm button for me.

I'm sure it would also wake me up from autopilot. But I don't do this often so I can't really know. It seems like this is good enough for many people, who don't perform this action too often.

coder5435y ago

If you really think that’s an issue, pasting could be disabled for that input field. Would that make you happier?

It hasn’t been an issue for me, since repo names aren’t usually super long and onerous to type.

gruez5y ago

> If you really think that’s an issue, pasting could be disabled for that input field. Would that make you happier?

many (most?) HN users probably have that disabled, because too many sites abuse it to block password managers, for "security reasons" .

1 more reply

temporallobe5y ago

This is similar to a UI solution a colleague and I came up with. The action the user could kick off was unstoppable and irreversible (a large batch job), and it seemed like even a confirmation prompt was too easy to simply click through. So we had the UI present a modal dialog asking the user to type in a specific word in all caps to confirm the action. Worked like a charm.

D-Coder5y ago

I did a similar thing with a Star Trek program many years ago. One of the commands (22? 23?) was to detonate the warp engines in the hope of taking the enemy with you.

After hitting the wrong number once, I added a confirmation that presented a random six-digit number that you had to enter before it accepted the command.

TravHatesMe5y ago

Reminds me of a study done where a test was given with questions that weren't difficult but likely to make a silly error. Around 85% of participants got at least one question wrong, but when they repeated the same test with a difficult-to-read font, that number dropped to ~25% or so. That's another way to make your brain work, use a terrible font.

apricot5y ago

> That's another way to make your brain work, use a terrible font.

And suddenly my complex analysis prof who wrote his exams in Comic Sans is vindicated!

willvarfar5y ago

I am so adding this to a query api I have, where its all too easy to leave off constraints and end up asking for massive data sets by mistake.

Thinking I can probably enhance it by forcing the user to type in the number as text rather than numeric, so they can't cut-n-paste. Kind of force them to type in "I am sure I want all data ever" or something.

recursive5y ago

I don't think this is useful for an api. This is only useful when humans are the direct user of the component. Automated users, like those of an API will dutifully provide the required safety value.

mcintyre19945y ago

AWS sometimes does something similar to this like “enter the name of the thing you’re trying to delete to confirm”. I think it makes sense because you can have such a huge difference between how much you care about certain s3 buckets or CloudFormation deploys etc. In true AWS fashion it’s inconsistent between services though.

nucleardog5y ago

To their credit, even if it’s unintentional, every time one of those screens pop up I have to stop and think about what I’m doing because every screen wants something different from me!

heelix5y ago

Back in the Spiderman 2 days, I worked for a content management company that was supporting a really, really big website. I believe they were playing host file games for Stage/Prod. Was in the room on when they demo'ed something, did a restart of the system - and every pager in the room went off. Yah...

Cthulhu_5y ago

I for one can't fathom any organization managing a million devices / servers / VMs / whatnot. I'm having enough trouble with one, and my biggest employers had maybe a few dozen at best, and they already had a dedicated ops team that worked mainly with infrastructure-as-code.

woliveirajr5y ago

Once I had to deal with some software-RAID in Linux (mdadm it is), around 2007. There was some -force option that would just print information explaining what it would do and, to perform the real action, you needed to type another flag (that should never be revealed).

Edit: added name of software

andrewfromx5y ago

i've done this before by displaying unix epoc and asking the user to copy/paste that value WITHIN a 3 second window as an env var. i.e. if you up arrow and run same TIMESTAMP=1603827448 ./foo it won't work because 1603827448 is now way too old.

myroon55y ago

One of the main benefits is explicitly acknowledging relevant context. Timestamps don't provide additional relevant context

sidpatil5y ago

Hmm, it's conceptually like a combination of a CAPTCHA and a launch code.

vsnf5y ago

I do this with a git pre-push hook to the main branch of my repositories. It displays a prompt in red and forces me to type in the name of the branch.

The result of one too many mindlessly accidental pushes.

regularfry5y ago

I've seen this implemented as "Please type: My username is $USERNAME and I will not cry over spilt milk" but that was more to guard against support tickets.

diebeforei4855y ago

I'm thinking this could also be useful for cases where colleges mistakenly email all applicants saying they'd been accepted, when they in fact had not been.

gitgud5y ago

> "I've worked at a few places that had a large number of Linux boxes. I'm talking about well over a million."

A few places!? What is an example of this?

throwawaygh5y ago

My guess: Rackspace, Google, and Facebook.

ComodoHacker5y ago

In role-playing games, it's a common practice to confirm deletion of your character by typing in some word, like 'delete' or character name.

bnastic5y ago

Promise Pegasus (thunderbolt storage) comes with a GUI that does the same thing - to shut it down you have to type “CONFIRM” before clicking the button

Animats5y ago

Yes. Github does that when you delete a repository. You have to confirm by typing in the name of the repository you are deleting.

larrik5y ago

I've seen this sort of thing in a few places, and I really do think it's a great idea.

RobRivera5y ago

Having babysat my fair share of critical clusters, i support this advice

wotton5y ago

Marketo, the marketing automation platform, does this when you try to do things to large data sets, very useful.

konjin5y ago

Finally the Roman numeral converter I programmed in university will be useful.

eznzt5y ago

Debian already does this, it asks you to type something like "yes do as I asked" if you want to remove a package that is considered to be part of the core.

jerf5y ago

https://news.ycombinator.com/item?id=24907002

Looks like https vs http link.

dang5y ago

We've merged the threads now. Thanks!

jancsika5y ago

It would be neat to print out an esoteric error that gets a single result in Google, where the "forum" in the result has a rando answer about using a certain esoteric flag.

Then you search the logs to see who is trying the command with the esoteric flag and "fix the glitch with payroll" for those employees.

JoeAltmaier5y ago

Makes it harder to nest that command inside a script - you have to parse out the number and paste it back? Or do I misunderstand - should it still prompt the user in the middle of the process when that step arrives? That would be problematical if it were included in a web page or whatever.

ccakes5y ago

The very point of this is to make it difficult to do what you’re describing.

If the tool could potentially touch a large number of machine, even if you’re super sure you got it right you should still prompt the user

JoeAltmaier5y ago

Or write a script that carefully calculates the number of machines and gets it right. I guess you wouldn't use this prompting script then?

larrik5y ago

I believe this would be as part of the script you are writing, not the scripts you are calling.

rad_gruchalski5y ago

Hopefully there’s an API to fetch that count :)

outworlder5y ago

> 1221425541 machines will be affected

"Do you care? (Y/N)"

Cattle, people. Not pets. Just make sure you don't hit all machines simultaneously and are rolling, instead.

Since the post is talking about automation anyway, assume that any machine that can go down will go down. Ensure that any such disruption will be minimal. Oops, you just killed the production database? Whatever, who cares, it has just failed over anyway (or, for a distributed one, a new node was elected, data started replicating, etc).

If one considers having to SSH to a machine to be an anti-pattern, it's amazing how much crap goes away.

In the more generalized case, where it's not about machines, then it makes more sense. Maybe you are running a query that's going to perform updates across multiple clusters. It still should not be done by hand with direct production access - unless you are in the middle of a declared (and urgent!) incident and everything is on fire. In which case there's a bunch of people watching over your shoulder (or more likely, screen sharing in a conference call).

The same job you have (hopefully) run in QA you should be able to re-target to production. Make the question just be a way to "unlock" your automation - for instance, by not copying credentials or environment information until the proper confirmation has been received. One should still have an escape hatch for when (not IF) things go wrong.

joshuamorton5y ago

Killing all of your cattle is still a concern.

j / k navigate · click thread line to collapse

332 comments

csmattryder5y ago