But, does it really matter?
I read people reacting strongly to these outages, suggesting that due dilligence wasn't done to use a 3rd party for this or that. Or that a system engineered to reach anything less than 100% uptime is professional negligence.
However from the top of my head we've had AWS outages, Gmail outages, Azure outages, DNS outages, GitHub outages, whatever else. All these hugely profitable companies are messing this stuff up constantly. Why are any of us going to do any better and why does a few hours of downtime ultimately matter?
I think it's partly living somewhere where a volcano the next island over can shut down connections to the outside world for almost a week. Life doesn't have an SLA, systems should aim for reasonable uptime but at the end of the day the systems come back online at some point and we all move on. Just catch up on emails or something. I dislike the culture of demanding hyper perfection and that we should be prepared to do unhealthy shift patterns to avoid a moment of downtime in UTC - 11 or something.
My view is increasingly these outages are healthy since they force us to confront the fallibility of the systems we build and accept the chaos wins out in the end, even if just for a few hours.
For example, I'm building a note-taking / knowledge base platform, and we were having some reliability issues last year when our platform and devops process was still a bit nascent. We had a user that was (predictably) using our platform to take notes / study for an exam, which was open book. On the day of her exam our servers went down and she was justifiably anxious that things wouldn't be back before it was time for her exam to start. Luckily I was able to stabilize everything before then and her exam went great in the end, but it might not have happened that way.
Of course most on HN would probably point out that this is obviously why your personal notes should always be hosted / backed up locally, but I of course took this as a personal mission to improve our reliability so that our users never had to deal with this again. And since then I'm proud to say we've maintained 99.99% uptime[1]. So yes, there are definitely many situations where we can and should take a more laid back approach, but sometimes there are deadlines outside of your control and having a critical piece of software go offline exactly when you need it can be a terrible experience.
And they would be right. Having your notes pushed up to the cloud is great and I use a feature like that all the time (specifically with iCloud and either the Notes app or beorg), but the most recent version of these documents should always be available offline.
Is your application unavailable without a network connection? What if you go somewhere without reception?
This is a great line of thought, I'd encourage everyone to take it. There's a huge amount of crap people get up to that is mostly about performative debt balancing - people feel that they're owed something just because <fill in the blank>, when it really didn't matter. Just another gross aspect of a culture overly reliant on litigation for conflict management.
But. the question is meaningless without qualifying, for whom?
Because I can absolutely imagine situations where an Auth0 outage could be extremely damaging, expensive, or both. Same for a lot of other services.
> Life doesn't have an SLA
Nope. Which is a part of the reason why people spend money on them for certain specific things. It is just another form of insurance against risk.
The lesson is more that everything fails all of the time and the more interconnected and dependent we make things the more they fail. That is not something that can be solved with another SaaS as multiple downtimes, hacks, leaks and shutdowns have shown time and time again.
My reaction was more against the performative "haha, foolish n00b developers didn't build their system to use both Lambdas and Google Cloud and then failover to a data center on the North Pole like me, the superior genius that I am" that oftentimes appears in threads about downtime.
We could all do with a bit more "there but for the grace of god" attitude during these incidents while still learning lessons from them.
And more importantly, if YOU try to use something "not big" and it goes down, it's on YOU - but if you're using Azure and it goes down, it's "what happens".
For context I used to live in the UK which is probably, outside of South East Asia one of the most "online" societies (and miles ahead of the US in terms of things like online payments processing). I never carried cash, online orders for everything, etc.
I moved to Barbados towards the end of last year and let's just say there's a lot of low hanging fruit for software systems here. It takes about 4 months to get post from the UK, you can't really get anything from Amazon. There's a single cash machine that takes my card and sometimes it's out of money or broken and you can't open a bank account without getting a letter from your bank in the UK, with the aforementioned 4 month delay. Online banking doesn't exist. There was maybe 1 Deliveroo type service that was actually a front for credit card scamming and maybe 1 other food delivery app.
In a sense it has been so much more pleasant than life in the UK and not just because of the cheap beer and sunshine. If I have a problem I know my neighbours to speak to. I know the people in the bar, I know who can help me out if I ran out of money or needed food to tide me over.
This is all a bit 'trope of the noble savage', as if life was better off before all that technology or something. I don't believe that's the case however I also believe over-reliance/the belief in always-up systems reduces societal resilience. Certain things have to work, you have to be able to phone the ambulance and it comes (or alternatively know someone who could drive you to the hospital in a pinch), food has to get shipped in at some point, since a diet of cane sugar alone won't be sufficient. And for that supply chain technology, etc. is important. But there are many other types of software regarded as "vital" that I don't think are and the criteria for what is vital is actually a lot stricter than it can feel. And there's a lot more room for delay than we'd maybe feel when caught up in the tech bubble.
Instead of being able to make a judgement call or respond appropriately to changing circumstances; instead of being relied upon for your ability to judge the needs of your students accurately, you risk being flagged up for not sticking to protocol in matters of ~student~ consumer interaction.
If a cheater slips through, does it matter, that much, if the cheating is getting an extra few days of time to complete an assignment?
Aren't universities meant to be about expanding knowledge, places of learning? Aren't we making a mockery of the whole idea of tertiary education getting so caught up in catching 'consumers' gaming the system and the risk debasing 'consumer currency points' or exam scores in order to justify the busywork of admin departments? Software and software enabled culture is incredibly powerful but it also removes human factors and discretion and has made many things worse.
Pre-COVID, schools shut down for other reasons, like snow days. Doesn't seem much different.
> However from the top of my head we've had AWS outages, Gmail outages, Azure outages, DNS outages, GitHub outages, whatever else. All these hugely profitable companies are messing this stuff up constantly. Why are any of us going to do any better and why does a few hours of downtime ultimately matter?
I've been mulling this for a while too, and I think I might have some responses that address your thought somewhat:
- Amazon/Google/Microsoft/etc. services have huge blast radii. If you build your own system independently, then of course you probably wouldn't achieve as high of an SLA, but from the standpoint of users, they (usually) still have alternative/independent services they can still use simultaneously. That decoupling can drastically reduce the negative impact on users, even if the individual uptimes are far worse than the global one.
- Sometimes it turns out problems were preventable, and only occurred because someone deliberately decided to bypass some procedures. These are always irritating regardless of the fact that nobody can reach 100% uptime. And I think sometimes people get annoyed because they feel there's a non-negligible chance this was the cause, rather than (say) a volcano.
- People really hate it when the big guys go down, too.
Like this isn't a space for hobbyists or people just doing things in a decentralized manner anymore. The joke from British TV Sitcom 'The IT Crowd' where (the bigwigs are sold the lie that) the internet is a blinking black box in the company offices is actually true. Like something goes wrong with some obscure autoscaling code and actually, the little black box did break the entire internet.
I'm the kind of person who hates AWS and wants to live in the woods eating squirrels, but I can't really begrudge them downtime.
In our case ( Azure downtime), because none of our customer systems would work.
This includes people on the road, that need to do something every 5 minutes on their PDA ( sometimes 100 people simultaneous in a big city)
So yes, it matters.
I happened to work in designing critical infrastructure for emergency services. We always had a failure in the plan, which is why part of our deliverable was a protocol for paper logging of the calls (ambulance, police, military...) and the subsequent following of the case. It worked amazingly when the system did go down. In part because it was roleplayed, in part because the system went down in a rather convenient time. The data was then added to the digital logs, and all was well in the world, including the people saved by the, and I kid you not, pen, and, paper... and other humans gasp
I'm just hoping the people building the ambulance dispatch networks aren't using Azure :laughing:.
The answer is surprisingly simple.
Most outages are the unintended result of someone doing something. When you are doing things yourself, you schedule the “doing something” for times when an outage would matter least.
If you are the kind of place where there is no such time, you mitigate. Backup systems, designing for resiliency, hiring someone else, etc.
I think an important consideration here is that a huge amount of time, money, and resources is spent on making sure the computers stay powered and cooled in all manner of situations. We contract redundant diesel delivery for generators, we buy and install gigantic diesel generator systems which are used for just minutes per year, huge automatic grid transfer switches, redundant fiber optic loops, dynamic routing protocols, N+1 this and double-redundant that. It's tremendously expensive in terms of money, human time, and physical/natural resources.
The point is that we are always striving to plan for failures, and engineering them out. When there is a real life actual outage, it means, necessarily, based on the huge amount of time and money and resources invested in planning around disaster/failure resilience, that the plan has a bug or an error.
Somebody had a responsibility (be it planning, engineering, or otherwise) that was not appropriately fulfilled.
Sure, they'll find it, and update their plan, and be able to respond better in the future - but the fundamental idea is that millions (billions?) have been spent in advance to prevent this from happening. That's not nothing.
I'd also highlight that when the big players go down people 'know' it's not your fault, when a small 3rd party provider goes down taking part of your service with it it's 'because you didn't do due diligence' or were trying to save a buck. Similar in a way to the anachronism 'no one got fired for buying IBM'
I think people know this implicitly, but it's good to think about it explicitly. Does downtime matter, and how much is acceptable should be a question every system has decided on. Because ultimately uptime cost money, and many who are complaining about this outage are likely not paying anywhere near what it would cost to truly deliver 5+x9s or Space Shuttle level code quality.
Yes, people should relax a bit, but those incidents you cite did cost those companies customers. That's okay for Amazon. But a small B2B service provider can't as easily absorb the loss.
Hard to do when you can't authenticate to the email webapp.
And then it /does/ and all of us lose our shit haha.
It's similar to ubiquitous next day delivery conditioning people to find anything longer unacceptable, when cheap next day is quite new and not even the norm yet.
You're completely right that a 100% availability is unreasonable and often times, never required despite what a customer or site operator may believe.
Just a quick aside, availability (can an end user reach your thing) is often confused with uptime (is your thing up). If I operate a load balancer that your service sits behind and my load balancer dies, your service is up, but not availabile for those on the other side of said load balancer.
With that in mind, Hacker News could be theoretically up 100% of the time but if I go through a tunnel while scrolling Hacker News on my mobile phone, from my perspective, it is no longer 100% available, it is 100% - (period I was without signal) available, from my personal perspective as a user.
The point here is that a whole host of unreliable things happen in every day life from your router playing up to sharks biting the undersea cables.
With that in mind, you then want to go and figure out a reasonable level of service to provide to your end users (ask for their input!) that reflects reality.
It's worth noting too that Google (I don't love 'em but they pioneered the field) will actually intentionally disrupt services if they're "too available" so as to keep those downstream on their toes. It's not actually good for anyone if you have 100% availability in that they make too many assumptions and also, it's just good practice I suppose.
I can recommend reading the SLOs portion of the Google SRE book if you're curious to see more: https://sre.google/sre-book/service-level-objectives/
In short, an SLO is just an SLA without the legal part so a guarantee of a certain level of service, often internally from one team to another.
Ideally these objectives reflect the level of service your customers (internal or external) expect from your service
> Chubby [Bur06] is Google’s lock service for loosely coupled distributed systems. In the global case, we distribute Chubby instances such that each replica is in a different geographical region.
> Over time, we found that the failures of the global instance of Chubby consistently generated service outages, many of which were visible to end users. As it turns out, true global Chubby outages are so infrequent that service owners began to add dependencies to Chubby assuming that it would never go down. Its high reliability provided a false sense of security because the services could not function appropriately when Chubby was unavailable, however rarely that occurred.
> The solution to this Chubby scenario is interesting: SRE makes sure that global Chubby meets, but does not significantly exceed, its service level objective. In any given quarter, if a true failure has not dropped availability below the target, a controlled outage will be synthesized by intentionally taking down the system.
> In this way, we are able to flush out unreasonable dependencies on Chubby shortly after they are added. Doing so forces service owners to reckon with the reality of distributed systems sooner rather than later.
Last time was due to several factors, but initially because of silently losing some indexes during a migration. I'm very curious what happened this time -- we'll definitely do a followup episode if they publish a postmortem.
It's a very handy way to keep up with websites without having facebook/twitter/whatever in the middle.
I had to go lookup the rss feed from the html source code...
edit: aaand the rss is empty.
[1] https://github.blog/2018-10-21-october21-incident-report/
[2] https://github.blog/2018-10-30-oct21-post-incident-analysis/
If it's data security or something else that's your concern, you can host the data in your own database with their enterprise package.
General disclaimer: I'm a paying Auth0 customer but just use it for authentication, and it saved me a hundred hours of work for a pretty reasonable price.
Only on HN will you be told "you're an idiot if you outsource your auth" and "you're an idiot if you roll your own auth" by the same group of people.
It’s something easy to get wrong, and has a long tail of work which is extremely generic (supporting all the different social logins, two factor authentication, password reset emails, email verification, sms phone number verification, rate limiting, etc...)
Incentives are perfectly aligned there, and if anyone can keep a system running and secure (to everyone except the US military which can compel them), it's them.
I didn't spend a lot of time on it but initially figured it would be easy because they had what seemed to be a well-written and comprehensive blog post[1] on the topic, as well as a native plugin.
But I found a few small discrepancies with the blog post and the current state of the plugin (perhaps not too surprising; the blog post is 2 years old now and no doubt the plugin has gone through several updates).
I found the auth0 control panel overwhelming at a glance and didn't want to spend the time to figure it all out - basically laziness won here, but I feel like they missed an opportunity to get a customer if they'd managed to make this much more low effort.
I moved on to something else (had much better luck with OneLogin out of the box!), but then got six separate emails over the next couple weeks from a sales rep asking if I had any questions.
I'm sure it's a neat piece of kit in the right hands or with a little more elbow grease but I was a bit disappointed with how much effort it was to get up and running for [what I thought was] a pretty basic use case.
For password use case, it seems nice that you don't have to store client secrets (eg encrypted salted passwords) on your own infra. However now instead of authentication happening between your own servers and the users browser, there is an additional hop to the SaaS and now you need to learn about JWT etc. At my previous company, moving a Django monolith to do authentication via auth0 was a multi month project and a multi thousand line increase in code/complexity. And we weren't storing passwords to begin with because we were using onetime login emails links.
Maybe SaaS platforms are worth it for social login? I haven't tried that, but I am not convinced that auth0 or some one else can help me connect with facebook/twitter/google better than a library can.
I just can't even imagine why you would these days, there are even "local" options that act as "local 3rd party auth providers".
Screw losing sleep over whether you're storing credentials correctly.
We’ve looked at Auth0 and Okta because we wanted to see if we can save some dev time devising RBAC and supporting a lot of different auth integrations. Ended up doing it in house since the quote was unacceptable (essentially a mid-level dev salary per year)
However, I came across this specific need of implementing both Authorization and resource server on the same application and for that I'm planning to implement Authorization Server using Spring but I came to know that Spring have stopped active oauth project development and so I'm planning to use Keycloak for my application also I'm planning to store client id & client secret in mysql database.
In authorization server I have to generate access token and then send it back to the client and verify when the api call is made with the same token.
If you don't mind do you have any link or specific resources for the development which you did? I would love to see your project as well. Thanks.
TL;DR feature flag service was to blame
also, security practices are supposedly better and more robust there than at your average place.
i think those two things are the value adds.