The trick with this perspective is that after identifying the real risks you can then link the risks and possible mitigations by looking at all 'things' and identifying the ways in which they might fail (and how this may be prevented from happening). This way you can easily identify which mitigations are helping prevent risks and which risks are not sufficiently mitigated. It's a fair bit of work, but it's not complicated and often gives useful insights.
What this article basically does is note that you should first asses what risks a failed deployment has, and correctly states that in quite a few cases this risk is low and therefore the mitigations (of which there can be many) may not be necessary and may in fact be doing harm without actually sufficiently preventing any risk.
Say the company you work for is worth $10,000,000, and that you're hosted on GCP. Now take your best guess: what do you think the likelihood is of e.g. a fire or earthquake or something occurring in all relevant Google infrastructure simultaneously*, basically ushering in the end of all of your infrastructure, data, and backups? Frame that in a number of years. Is this kind of event something that may happen once in a thousand years? Once in ten thousand years? Let's say this is the sort of thing that might happen once in ten thousand years -- that's a long time!
Then the cost of this particular risk to your company is $1000 / year.
This kind of math isn't just a toy. When you have questions like "would maintaining actual physical backups in a safe somewhere outside of GCP be worth it?", you now have a framework to answer them ("if it would cost less than $1000 per year, then yes")
--
* or substitute in your favorite company-ending event.
This avoids two nasty problems with trying to express risk as an expected value.
The first is that it is hard to express all kinds of probabilities and damages numerically, not all kinds of damages convert easily to money, and some probabilities are hard to guess (you quickly get uncertain probabilities, but expected values just flatten those into an average again). Even without those issues pinning a number on it can lead to lots of discussion (good if you want discussion, not so good if you want to get shit done).
The second is that you easily fall into the trap of assuming everything has an average, and that the law of large number applies. While physics kind of helps you there by putting hard limits on the maximum amount of damage possible, you may end up in a situation where all nasty stuff is in the long improbable tail. Good example is earthquakes, magnitude increases tenfold for every point in the Richter scale but frequency also only decreases tenfold, what then is the average?
Well and something that's not really a big problem, but worth thinking about, some of these eventualities may very well cause you damage but are beyond your sphere of influence. Sure you should try to avoid going bankrupt if someone knocks over a server rack, but if all google data centres go down over an entire continent you've got bigger fish to fry. So focusing on the things you can do something about is a helpful way to keep focused.
"We are spending 50$ per month just for one test in our code. We could cut it down to 10$ if we wanted."
"How many hours would it take to reduce the spend? If it's more than a couple of hours for a senior engineer, then it's not worth it."
We kept spending money on this inefficient test and it was the right choice.
I had a similar conversation as a new-ish fractional CTO last year. One team was working on a new CRM product that was effectively alpha-level software used only internally. The team had become terrified of shipping and breaking something and was horrifically risk averse. For a new release that the team was going to delay again at the last minute, I got the CEO on the release call and asked him what would happen if the release completely failed and it took us an entire day to get the product working again. He replied “Not a big deal. The users would just write stuff down like they do today and key it on tomorrow. It’s not like this has enough features to be critical or anything”.
The team was completely stunned. It goes without saying we did the release, found a small mistake, fixed it, and life went on.
Teams really do have understand who her users are and criticality of the software.
We were working towards a business continuity plan which can include incidents like your main office and operations being destroyed and having to quickly relocate all services to 3rd-parties using off-site backups with minimal staff. While that was a worst-case, a primary focus was just getting a notification site up and running in the event of a network outage because that was vastly more frequent and had high visibility.
It was a very interesting project and I learned quite a bit about how to think comprehensively about the solutions we provided.
The biggest practical impediment to increasing velocity of delivery that I encounter is trying to convey this. People can visualize and estimate the risk and impact of a deployment gone wrong, but have a hard time estimating the impact of processes that slow down delivery. Therefore they overindex in heavy and "safe" processes (which often don't increase safety) at the cost of speed of iteration.
I'm not sure how to define this asymmetry, maybe some variation of loss aversion.
Sometimes not even that. We have see many huge breakdowns in recent years that did hit bottom lines, but didn't impact stock prices. Perhaps, at least for a publicly-traded company, the only real risks are those that might impact stock prices. That might include things that hurt other companies if doing so might result in money leaving the entire sector.
For example, it's way easier/faster to implement observability and some sort of rollback of bad versions than to try and prevent every possible way an app could crash and trigger a bunch of problems. What's going to happen if the app crash is pretty simple : customers will be mad (CS/Marketing/PR can handle them), you'll notice the downtime quickly and rollback (or maybe even rollback automatically!). Then you'll be in a perfect position to handle what went wrong : systems will be back on a known stable position and all the stress of trying to fix something in a live production system will be gone.
I think it's the definition of the black swan theory[1].
And one of those standards, an no I don't give a shit about developer experience, software or otherwise, should be you never ever test on production. As soon as you work on real products for real customers you better start behaving like a professional. Childs play is over as soon as some one is paying you to do stuff.
Not being confident in your test plan is a sign of immaturity not maturity because at some point you are going to need to validate how something behaves in production.
There are a wide range of processes, procedures and software architectures to get you to being confident that your production testing is doing more good than harm for your customers but in an environment where you can deploy new software you are going to do some testing in production.
I've heard "feature flags" are popular these days, and I understand that that's where you commit code for a new way of doing things but hide it behind a flag so you don't have to turn it on right away.
Now, if I want to test in prod, couldn't I just make the flag for my new feature turn on if I log in on a special developer test account? And if everything goes well, I change the condition to apply to everyone?
As long as your code makes sure it takes account of that flag everywhere that it is used. Otherwise your new feature could "leak" into the system for everyone else.
Plus, as systems grow in complexity, there's always a danger that features step on each other. We'd like to think that everything we write is nicely isolated and separated from the rest of the system, but it never works that way - plus we're just a group of squishy humans who make mistakes. There will be times when having Features A and C switched on, with B switched off, produces some weird interactions that don't happen if A, B and C are switched on together.
There ends up being code to deal with what happens when various combinations of flags are on/off, and that code doesn’t get tested much.
And teams spend a lot of time just removing flags.
This isn’t a safety-critical app - I really think they’d do better dropping the flags, and just deploying what they want when it’s ready.
You not only waste time with "Remove feature flag X" stories if all customers end up with the feature, you also slow down the response time of some categories of bugs, because you end up having to stop and check the combination of feature flags to reproduce a bug.
And if you end up with a feature that isn't popular except by one customer, not only are you now stuck supporting "Legacy feature Y", you're actually stuck supporting, "Optional legacy feature Y" which is worse.
Maybe I'm ranting about "misuse of feature flags", but I don't like to pontificate about how things ought to be, but how in my experience they actually are.
In my experience, feature flags work best if you aim to remove them as quickly as possible. They can be useful to allow continual deployment, and even for limited beta programs, but if you're using them to enable mature features for the whole customer base, they're no longer feature flags.
Definitely doesn't do anything like completely obviate the issue though.
But there are numerous ways to use feature flags incorrectly - typically once you have multiple long-lived flags that interact with each other, you've lost the thread. You no longer have one single application, you have n_flags ^ 2 applications that all behave in subtlety different ways depending on the interaction of the flags.
There's no way around it - you have to test all branches of your code somehow. "Just let the users find the bugs" doesn't work in this case since each user can only test their unique combination of flags. I've regularly seen default and QA tester flag configurations work great, only to have a particular combination fail for customers.
The only solution is setting up a full integration test for every combination of flags. If that sounds tedious (and it is), the solution is to avoid feature flags, not to avoid testing them!
I've long been wondering whether there are tools that help with that. Like they measuring a test suite's code coverage but for feature toggle permutations. Either you test those permutations explicitly or you rule them out explicitly.
It can also be a huge PITA. The fallacy is that a "feature" is an isolated chunk of code. You just wrap that in a thing that says "if feature is on, do the code!". But in reality, a single feature often touches numerous different code points, potentially across multiple codebases and services/APIs. So you have to intertwine that feature flag all over the place. Then write tests that test for each scenario (do the right thing when the feature is off, do the right thing then the feature is on). Then you have to remember to go back and clean up all that code when the feature is on for everyone and stabilized.
It's a good tool, but it's not an easy tool like a lot of folks think it is.
For example maybe the feature flag just shows/hides a new button on the UI. The rest of the code like the new backend endpoint and the new database column are "live" (not behind any flags) and just invisible to a regular user since they will never hit that code without the button.
As far as "remembering" to clean up the feature flag, teams I've been on have added a ticket for cleaning up the feature flag(s) as part of the project, so this work doesn't get lost in the shuffle. (And also to make visible to Product and other teams that there is some work there to clean up)
For example, the Microsoft Azure public cloud has a hierarchy of tenant -> subscription -> resource group -> resource.
It's possible to have feature flags at all four levels, but the most common one I see is rolling deployments where they pick customer subscriptions at random, and deploy to those in batches.
This means you can have a scenario where your tenant (company) is only partially enabled for a feature, with some departments having subscriptions with the feature on, but others don't have it yet.
This can be both good and bad. The blast radius of a bad update is minimised, but the users affected don't care how many other users aren't affected! Similarly, inconsistencies like the one above are frustrating. Even simple things like demonstrating a feature for someone else can result in accidental gaslighting where you swear up and down that they just need to "click here" and they can't find the button...
Not to mention it looks really awkward when an account manager has forgotten to enable some great new feature for you.
I’m not a fan of this article in general, however, a lot of what it talks about is anti-pattern in my book. Take the bit about Micro-services as an example. They are excellent in small teams, even when you only have 2-5 developers. The author isn’t wrong as such, it’s just that the author seems to misunderstand why Conway’s law points toward service architectures. Because even when you have 2-5 developers, the teams that actually ”own” the various things you build in your organisation might make up hundreds of people. In which case you’re still going to avoid a lot of complexity by using service architecture even if your developers sort of work on the same things.
Classic story: https://dougseven.com/2014/04/17/knightmare-a-devops-caution...
> Unfortunately there is no easy way to distinguish between people who are good and need a paycheck from people who just need a paycheck. But you sure as hell don’t want the latter in your team.
If you can't tell them apart, then the distinction is unimportant. So if among the group of people who need paychecks, good is indistinguishable from non-good, the comment serves no purpose other than needless elitism.
> If GitHub makes a mistake it can affect thousands of businesses but they’ll likely shrug and their DevOps team will just post “GitHub is down, nothing we can do” on some Slack channel.
Gonna try and read the rest of this on the lunch break as was surprisingly meaty for a clickbait title ;)
> That’s a terrible mistake and in the long run will be the cause of cost overruns, unmet deadlines, increased churn and overall bad vibes. And nobody wants bad vibes.
Of course, there are always exceptions to this rule. Adapt and modify the code as needed.
We keep three environments at work: Dev, Test and Prod. However, dev environments are sometimes neglected and some features land in Test only.
So, use Dev as a development playground. Use Test to test the changes made in Dev. If the change is approved in Test, it will go in Prod environment.
In this case, a good "Testing on Production" rule would be to not let customers test your software, period.
There's plenty of land and resources to construct towns and cities that simulate real-life commute very accurately.
In the case of self-driving (or even autopilot), you're not really testing a feature, you're researching a new product, they difference is vast.
A bug which must be fixed in production is much more expensive than a bug fixed during development.
People here complain when you bash Microsoft, but their phylosophy was (and still is) let the users test the product.
> Ask yourself a question: do you have any reason to think that your engineers will not do a good job? If the answer is no: why are they still there? If the answer is yes: let them do their damn job.
So well put, just today I implemented a feature and kept asking myself if i should be extending the component (leaning more towards OOP) or just add an additional argument to said component. The latter would have stuck more with the current style but I also realized there's no obvious better way, extending made sense and I realized the importance of understanding the nuance and standing up for those design decisions is what I am here to do :)
thank for putting that in less words
One of the things I've realized is that in most unregulated companies (read: non-healthcare/financial) the business side of the house is used to having little or no lower lifecycle.
If they want to make a process change, they make it on production work.
Granted, they have change control approvals, etc. etc., but the whole dev-test-prod cycle looks extremely different for them, because you can't do certain things without lower environments.
I worked at a home remodeling company. Revenue was several million dollars a day. App handled sales, scheduling, logistics, everything. Breaking production was a big deal, it cost us millions per day and created logjams.
I would think that most online applications are the same. Even if a simple online web shop goes down you are costing money.
What kinds of experiences have you had where testing in production was the norm?
because you can't do certain things without lower environments.
I agree that this is something many shops REALLY struggle with.One of the most challenging things is exporting or creating some kind of realistic data set for local development use. I think 99% of companies struggle with this.
Even at startups, the added initial costs yield more long term benefits with higher-quality products.
(Don't take it too seriously, like I said this is mostly a brain dump, I'm sure there's a lot of stuff that can be improved)
I see this challenge a lot in the industry. The young engineers truly are smart, even brilliant, but lack wisdom and experience.
Words change, they always have, they always will. Get over it.
And anyway, the article's usage is consistent with the well-established phrase "smart guy", within which the word "smart" carries a sarcastic and derisive tone.
While this is true, I think it is helpful to communication to resist changes to language. This isn't the same thing as opposing change entirely, but language needs to have a certain stability and common understanding to maximize its usefulness.
@dang any chance you could help here? :(