Tell HN: AWS appears to be down again

863 pointsriknox4y ago614 comments

Console is flickering between "website is unavailable" and being up for my team. This is happening very frequently just now, reliability seems to have taken a hit.

614 comments

aledalgrande4y ago

If you haven't seen yet, news is it was a power loss:

> 5:01 AM PST We can confirm a loss of power within a single data center within a single Availability Zone (USE1-AZ4) in the US-EAST-1 Region. This is affecting availability and connectivity to EC2 instances that are part of the affected data center within the affected Availability Zone. We are also experiencing elevated RunInstance API error rates for launches within the affected Availability Zone. Connectivity and power to other data centers within the affected Availability Zone, or other Availability Zones within the US-EAST-1 Region are not affected by this issue, but we would recommend failing away from the affected Availability Zone (USE1-AZ4) if you are able to do so. We continue to work to address the issue and restore power within the affected data center.

vinay_ys4y ago

This is quite interesting as they claim their datacenter design does better than Uptime's Tier3+ design requirements which require redundant power supply paths. [https://aws.amazon.com/compliance/uptimeinstitute/]. I really hope they publish a thorough RCA for this incident.

tyingq4y ago

"Electrical power systems are designed to be fully redundant so that in the event of a disruption, uninterruptible power supply units can be engaged for certain functions, while generators can provide backup power for the entire facility." https://aws.amazon.com/compliance/data-center/infrastructure...

So they have 2 different sources of power coming in. And generators. They do mention the UPS is only for "certain functions", so I guess it's not enough to handle full load while generators spin up if the 2 primaries go out. Or perhaps some failure in the source switching equipment (typically called a "static transfer switch").

Some detail on different approaches: https://www.donwil.com/wp-content/uploads/white-papers/Using...

vinay_ys4y ago

Usually when someone claims T3+ they mean they have UPS clusters in 3+1 (or such) configuration and two different such UPS clusters power two power-strips in a rack. Then, would also have incoming grid power supply from two different HV sub-stations with non-intersecting cable paths. They would also have diesel power generators in 3+1 or 5+2 configurations with automatic startup time in seconds. The UPS's power storage (chemical or potential energy based devices) can hold enough energy to handle full load for several minutes. If these are design and maintained correctly, even while concurrent scheduled maintenance is ongoing, an unexpected component failure should not cause catastrophic outage. At each layer (grid incomers, generator incomers, UPS power incomers) there are switches to switch over whenever there's a need (maintenance or failure).

If they claim tier4, then they basically have everything in n+n configuration.

1 more reply

dylan6044y ago

The generators should be powering up as soon as one of the 2 different sources goes down. It takes generators a few minutes to power up and get "warmed up". If they don't start this process until both mains sources are down, then oops, there's power outage.

I used to work next door to a "major" cable TV station's broadcast location. They had multiple generators on-site, and one of them was running 24/7 (they rotated which one was hot). A major power outage hit, and there was a thunderous roar as all of the generators fired up. The channel never went off the air.

3 more replies

AtlasBarfed4y ago

Has datacenter power redundancy undergone any sort of revolution with grid storage becoming industrial scale?

I wonder if a lot of AWS dc design in this area predates the battery grid storage revolution with (what my impression is) a far faster adaptation/switchover time than a generator spin up, and possibly software systems that work to detect and switch over quickly?

AWS can claim it will be best of breed, but they aren't going to throw out a DC power redundancy investment (or threaten downtime) that they can't wring more ROI on.

2 more replies

rainbowzootsuit4y ago

Likely the UPS can't run HVAC, and you are in an overheat condition in about two minutes with a fully loaded data center without cooling. Proportionately longer as load is reduced.

JshWright4y ago

> I really hope they publish a thorough RCA for this incident.

We're still waiting on the RCA for last week's us-west outage...

codeduck4y ago

another example of a single dc in a single AZ rendering an entire region almost unusable. This has shades of eu-central-1 all over again.

nightpool4y ago

Amazon is claiming the failure is limited to a single AZ. Are you seeing failures for instances outside of that AZ? If not, how has this rendered "the entire region almost unusable"?

matharmin4y ago

Yes, I've seen issues that affected the entire region. In my specific case, I happened to have an ElastiCache cluster in the affected AZ that became unreachable (my fault for single AZ). But even now, I'm unable to create any new ElastiCache clusters in different AZs (which I wanted to use for manual failover). And there were a lot of errors on the AWS console during the outage.

"almost unusable" is maybe exaggerating, but there were definitely issues affecting more than just the single AZ.

3 more replies

codeduck4y ago

We've had alerts for packet loss and had issues in recovering region-spanning services (both AWS and 3rd party).

Yes, some of these we should be better at handling ourselves, but... it's all very well to say "expect to lose an AZ" but during this outage it's not been physically possible to remove the broken AZ instances from multi-AZ services because we cannot physically get them to respond to or acknowledge commands.

edit: just to short circuit any "well, why aren't you running redundant regions" - we run redundant regions at all times. But for reasons of latency, many customers will bind to their closest region, and the nature of our technology is highly location-bound It is not possible for us to move active sessions to an alternate region. So something like this is... unpleasant.

2 more replies

londons_explore4y ago

A lot of people will automatically fail over jobs to other AZ's. That often involves spinning up lots more EC2 instances and moving PB's of data. The end result is all capacity on other AZ's gets used up, and networks get full to capacity, and even if those other zones are technically working, practically they aren't really usable.

2 more replies

tyingq4y ago

Perspective, I would guess. Unless you spend a lot of time on retry/timeout/fail logic around AWS apis, your app could be stuck/blocked in the RunInstances() api, for example.

SCdF4y ago

So dumb question from someone who hasn't maintained large public infrastructure:

Isn't the whole point of availability zones is that you deploy to more than one and support failing over if one fails?

IE why are we (consumers) hearing about this or being obviously impacted (eg Epic Games Store is very broken right now)? Is my assessment wrong, or are all these apps that are failing built wrong? Or something in between?

fulafel4y ago

IME people rarely test and drill for the failovers, it's just a checkbox in a high level plan. Maybe they have a todo item for it somewhere but it never seems very important as AZ failures are usually quite rare. After ignoring the issue for a while it starts to seem risky to test for it, you might get an outage due to bugs it's likely to uncover.

1 more reply

gpm4y ago

> or are all these apps that are failing built wrong

Deploying to multiple places is more expensive, it's not wrong to choose not to, it's trading off reliability for cost.

It's also unclear to me how often things fail in a way that actually only affect one AZ, but I haven't seen any good statistics either way on that one.

peeters4y ago

As I understand it for something like SQS, Lambda etc, AWS should automatically tolerate an AZ going down. They're responsible for making the service highly available. For something like EC2 though, where a customer is just running a node on AWS, there's no automatic failover. It's a lot more complicated to replicate a running, stateful virtual machine and have it seamlessly failover to a different host. So typically it's up to the developers to use EC2 in a way that makes it easy to relaunch the nodes on a different AZ.

1 more reply

robjan4y ago

That's the theory but in practice very few companies bother because it's expensive, complicated and most workloads or customers can tolerate less than 100% uptime.

sprite4y ago

I thought I was Multi AZ but something failed. I am mostly running EC2 + RDS both with 2 availability zones. I will have to dig into the problem but I think the issue is that my setup for RDS is one writer instance and one reader instance, each in a different AZ. However I guess there was nothing for it to fail over to since my other instance was the writer instance, so I guess I need to keep a 3rd instance available preferably in a 3rd AZ?

TruthWillHurt4y ago

Amazon shifts the responsibility for multi-AZ deployment to us customers, saving themselves complexity and charging us extra - win-win for them.

_joel4y ago

You're supposed to build your app across multiple AZ's but I know a lot of companies that don't do this and shove everything in a single AZ. It's not just about deploying and instance there but ensuring the consistency of data and state across the az's

xyst4y ago

This region in general is a clusterfuck. If companies by now do not have a disaster recovery and resiliency strategy in place, you are just shooting yourself in the foot.

philsnow4y ago

In today's world of stitching together dozens of services, who each probably do the same thing, how is one to avoid a dependency on us-east-1? Add yet another bullet to the vendor questionnaire (ugh) about whether they are singly-homed / have a failover plan?

It's turtles all the way down, and underneath all the turtles is us-east-1.

1 more reply

notyourday4y ago

We are being told that the are still issues in the USE1-AZ4 and some of the instances are stuck in the wrong state as of 16:15 PM EST. There's no ET for resolution.

alostpuppy4y ago

Why do folks host their stuff in us-East? Is there a draw other than organizational momentum?

dragonwriter4y ago

> Why do folks host their stuff in us-East?

Off the top of my head, US-EAST-1 is:

(1) topologically closer to certain customers than other regions (this applies to all regions for different customers),

(2) consistently in the first set of regions to get new features,

(3) usually in the lowest price tier for features whose pricing varies by region,

(4) where certain global (notionally region agnostic) services are effectively hosted and certain interactions with them in region-specific services need to be done.

#4 is a unique feature of US-East-1, #2-#3 are factors in region selection that can also favor other regions, e.g., for users in the West US, US-West-2 beats US-West-1 on them, and is why some users topologically closer to US-West-1 favor US-West-2.

1 more reply

superdug4y ago

It's the cheapest.

1 more reply

GrumpyNl4y ago

How come they dont have power backups?

chkhd4y ago

"When a fail-safe system fails, it fails by failing to fail-safe." - https://en.wikipedia.org/wiki/Systemantics

2-718-281-8284y ago

is that just playing with words?

5 more replies

redm4y ago

Some datacenter failures aren't related to redundancy. Some examples: 1) transfer switch failure where you can't switch over to backup generators and the UPS runs out, 2) someone accidentally hits the EOD, 3) maintenance work makes a mistake such as turning off the wrong circuits, 4) cooling doesn't switch over fully to backups and while your systems have power, its too hot to run. The list can go on and on.

I'm not sure why this is a big deal though, this is why Amazon has multiple AZ's. If your in one AZ, you take your chances.

taf24y ago

it was not a total power loss. out of 40 instances we had running at the time of the incident only 5 of our instances appeared to be lost to the power outage. the bigger issue for us was ec2 api to stop/start these instances appeared to be unavailable (but probably due to the rack these instances were in having no power). The other issue that was impactful to us was that many of the remaining running instances in the zone had intermittent connectivity out to the internet. Additionally, the incident was made worse by many of our supporting vendors being impacted as well...

IMO it was handled rather well and fast by AWS... not saying we shouldn't beat them up (for a discount) but being honest this wasn't that bad.

res0nat0r4y ago

If the rack your instances are running in are totally offline then the ec2 api unfortunately can't talk to the dom0 and tell the instances to stop/start, so you get annoying "stuck instances", and really can't do anything until the rack is back online and able to respond to API calls unfortunately.

chousuke4y ago

Sometimes, you have a component which fails in such a way that your redundancies can't really help.

I once had to prepare for a total blackout scenario in a datacenter because there was a fault in the power supply system that required bypassing major systems to fix. Had some mistake or fault happened during those critical moments, all power would've been lost.

Well-designed redundancy makes high-impact incidents less likely, but you're not immune to Murphy's law.

macintux4y ago

To my mind, among the more frustrating aspects to implementing protection against failure is that the mechanisms to be added can themselves cause failure.

It's turtles all the way down.

1 more reply

trelane4y ago

Anything can fail, even your backup, and especially if it's mechanical.

rdines4y ago

The battery backups (called uninterruptible power supplies) are only meant to bridge the gap between the power going out and the generator turning on, which is a few minutes. Did they say power was the issue this time? I suspect it’s actually something else (ahem network)

Spooky234y ago

Their datacenter(s) aren’t magic because they are AWS. That facility is probably a decade old and like anything else as it ages the technical and maintenance debt makes management more challenging.

thetinguy4y ago

They do. I remember watching one of their sessions where they showed every rack having its own battery backup.

tyingq4y ago

An article on that: https://datacenterfrontier.com/aws-designs-in-rack-micro-ups...

Interesting quote:

“This is exactly the sort of design that lets me sleep like a baby,” said DeSantis. “And indeed, this new design is getting even better availability” – better than “seven nines” or 99.99999 percent uptime, DeSantis said.

TrueDuality4y ago

According to the SOC certifications they give their customers they do.

ItsBob4y ago

I've built out many 42U racks in DC's in my time and there were a couple of rules that we never skipped:

1. Dual power in each server/device - One PSU was powered by one outlet, the other PSU by a different one with a different source meaning that we can lose a single power supply/circuit and nothing happens 2. Dual network (at minimum) - For the same reasons as above since the switches didn't always have dual power in them.

I've only had a DC fail once when the engineer was performing work on the power circuitry for the DC and thought he was taking down one, but was in fact the wrong one and took both power circuits down at the same time.

However, a power cut (in the traditional sense where the supplier has a failure so nothing comes in over the wire) should have literally zero effect!

What am I missing?

I've never worked anywhere with Amazon's budget so why are they not handling this? Is it more than just the imcoming supply being down?

growse4y ago

> 1. Dual power in each server/device - One PSU was powered by one outlet, the other PSU by a different one with a different source meaning that we can lose a single power supply/circuit and nothing happens

Nothing happens if you remember that your new capacity limit per DC supply is 50% of the actual limit, and you're 100% confident that either of your supplies can seamlessly handle their load suddenly increasing by 100%.

I've seen more than one failure in a DC where they wired it up as you described, had a whole power side fail, followed by the other side promptly also failing because it couldn't handle the sudden new load placed on it.

dijit4y ago

EDIT: I misunderstood you were talking about power feeds, the normal case is the run "48% as if it's 100%" (because of power spikes, but also most types of transformers run more efficiently under specific levels of load (40-60).

Normally this is factored into the Rack you buy from a hardware provider, they will tell you that you have 10A or 16A on each feed, if you exceed that: it will work, but you are overloading their feed and they might complain about it.

2 more replies

notyourday4y ago

> I've only had a DC fail once when the engineer was performing work on the power circuitry for the DC and thought he was taking down one, but was in fact the wrong one and took both power circuits down at the same time.

This is all local scale. Your setup would not survive a data center scale power outage. At scale power outages are datacenter scale.

Data centers lose supply lines. They lose transformers. Sometimes they lose primary feed and secondary feed at the same time. Automatic transfer switches cannot be tested periodically i.e. they are typically tested once. Testing them is not "fire up a generator and see if we can draw from it"

It is cheaper to design a system that must be up which accounts for a data center being totally down and a portion of the system being totally unavailable than to add more datacenter mitigations.

bombcar4y ago

The datacenter we were in had dual-sourced grid power (two separate grid connections on opposite sides of the block, coming from different substations) along with a room of batteries (good for iirc 1hr total runtime for the whole datacenter, setup in quad banks, two on each "rail"), _and_ multiple independent massive diesel generators, which they ran and switched power to every month for at least an hour.

And to top it off each rack had its own smaller UPS at the bottom and top, fed off both rails, and each server was fed from both.

We never had a power issue there; in fact SDGE would ask them to throw to the generators during potential brown-out conditions.

Of course this was a datacenter that was a former General Atomics setup iirc ...

notyourday4y ago

We were in a triple sourced data center. Fed by three different substations. Everything worked like a charm. Until Sandy hit. It did not affect us at all. But it affected the power company. And everything still worked fine, until one of the transfer switches transferred into UPS position and stopped working in that position.

ItsBob4y ago

Yes but if you have reliable power from two different sources then the biggest risk (I'd imagine) is the failover circuitry! Something that should be tested tbh.

Also, there are banks of batteries and generators in between the power company cables and the kit: did they not kick-in?

Again, this is all pure speculation: I have absolutely no idea of the exact failure, nor how their infrastructure is held together - this is all just speculation for the hell of it :)

notyourday4y ago

> Yes but if you have reliable power from two different sources then the biggest risk (I'd imagine) is the failover circuitry! Something that should be tested tbh.

That's ATS. It is not really advisable to test their under load performance because the failure of an ATS would be catastrophic. ATS typically would be tested at the installation and after that their parameters would be monitored.

Replacing a functional in line ATS would be a 9-12 months long project.

> Also, there are banks of batteries and generators in between the power company cables and the kit: did they not kick-in?

At high energy you are pretty much always going to use an ATS.

1 more reply

merlyn4y ago

Frying hardware can affect much wider scope.

I've had bad power supplies fry out taking the whole power circuit with it, and thus half (or whatever fraction) of the rack's power. I've also had bad power supplies bring down the whole machine as they shunted everything internal too.

When things go bad, anything can happen. You can provide the best effort, and it'll usually work as expected, but there will always be something that can overcome your best efforts.

vel0city4y ago

The only full datacenter outage I've personally experienced was a power maintenance tech testing the transfer switch between systems where the power was 90 degrees out of phase. Big oof.

theideaofcoffee4y ago

Transfer switches at any facility that's worth being colocated in are exercised as periodically as the generators to which they connect. In all of the facilities I have had systems in (>20MW total steady state IT load), that meant once per month at minimum to keep generators happy -and to ensure the transfer functionality works-, and more often if the local grid demands it, e.g. ComEd in Chicago, or Dominion in NoVA asking for load shedding.

ClumsyPilot4y ago

"It is cheaper to design a system that must be up which accounts for a data center being totally down and a portion of the system being totally unavailable than to add more datacenter mitigations."

Citation needed - the same issue with testing, data races and expensive bandwidth come up.

notyourday4y ago

At high energy the lead time for the components is measured not in days but in years.

1 more reply

uluyol4y ago

Why spend the cost on dual X and Y when you can failover to another cluster?

For big DC workloads, it is usually, though not always, better to take the higher failure rate than add redundancy.

ItsBob4y ago

Really? You'd think at Amazon's scale an additional PSU in a 1U custom-built server (I assume they're custom) would be a few tens of $ at most.

Actually, now that I type that it makes sense. Scaling a few tens of dollars to a bajillion servers on the off-chance that you get an inbound power failure (quite rare I'd reckon) might cost more than what they'd lose if it does actually fail.

So yeah, they're potentially just balancing the risk here and minimising cost on the hardware.

Edit: changed grammar a bit.

vel0city4y ago

At big cloud provider scale like Amazon, Azure, and Google they probably aren't even running PSUs at each server, they're probably doing DC at the rack these days. No point in having a million little transformers everywhere, far easier maintenance centralizing those and have multiple feeding the bus bars going to each rack.

1 more reply

bob10294y ago

> I've never worked anywhere with Amazon's budget so why are they not handling this?

Perhaps we are going to discover how AWS produces such lofty margins by way of their next RCA publication.

Bluecobra4y ago

> What am I missing?

My guess is that they cheaped out in having redundant PSUs to get you to use multiple availability zones. (More zones = more revenue)

Even a single PSU shouldn’t be an issue if they plugged in an ATS switch though.

Godel_unicode4y ago

Unless the ATS breaks, which happens.

2 more replies

lordnacho4y ago

What about a UPS/battery thingy? That's saved me a few times, though it normally just gives enough time for a short outage. Is it uncommon in cloud infra?

vel0city4y ago

For even regular datacenters they'll often have UPS systems the size of a small car, usually several of these, to power the entire datacenters for a few minutes to get the diesel generator started.

Hippocrates4y ago

Every time a major cloud provider has an outage, Infra people and execs cry foul and say we need to move to <the other one>. But does anyone really have an objective measure of how clouds stack up reliability-wise? I doubt it, since outages and their effects are nuanced. The other move is that they want to go multi-cloud... But I’ve been involved in enough multi-cloud initiatives to know how much time and effort those soak up, not to mention the overhead costs of maintaining two sets of infra sub-optimally. I would say that for most businesses, these costs far exceed that occasional six-hour-long outage.

mijoharas4y ago

I mean from the explanation[0], assuming that is correct (I don't have evidence to suggest it's false) - you don't need to be multi-cloud, and you don't even need to be multi-region. As long as you're spread out over multiple availability zones in a region you should be resilient to this failure.

Somewhat surprising to see how many things are failing though, which implies, either that a lot of services aren't able to fail-over to a different availability zone, or there is something else going wrong.

[0] https://news.ycombinator.com/item?id=29648992

omh24y ago

AWS doesn't follow their own advice about hosting multi-regional so every time us-east-1 has significant issues pretty much every AZ and region is affected.

Specifically large parts of the management API, and IAM service are seemingly centrally hosted in us-east-1. So called Global endpoints are also dependent on us-east-1 and parts of AWS' internal event queues (eg. event bridge triggers)

If your infrastructure is static you'll largely avoid the fallout, but if you rely on API calls or dynamically created resources you can get caught in the blast regardless of region

spmurrayzzz4y ago

Your last comment is really important, I think. I have always petitioned for "passive over active" design in distributed cloud systems. The recent outages, and also ones from the past, demonstrate why.

The fewer API calls you need to make in-band with whatever throughput is generated via your customer demand, the better. Related to that, I have been critical of lambda/FaaS/serverless infrastructure patterns for similar reasons. Always felt like a brittle house of cards to me (N.B. I do still use aws lambda, but keep it constrained to non-critical workloads).

1 more reply

Hippocrates4y ago

Yeah, my thought is not specific to this scenario. Indeed multi-AZ is a low cost and probably good idea because you often have a shared service management, control plane, and cheap bandwidth between things. Of course, when things fail they often ripple as may be the case here. I don't think clouds have their blast radius perfectly contained and they certainly don't communicate those details well.

One incident I recall was involving our GCP regional storage buckets, which we were using to achieve mutli-region redundancy. One day, both regions went down simultaneously. Google told us that the data was safe but the control plane and API for the service is global. Now I always wonder when I read about MR what that actually means...

zeckalpha4y ago

That’s true for this failure but the prior two for AWS were region wide and the one for GCP last month was global.

sdevonoes4y ago

Perhaps is us, the customers (and our customers, and the customers of our customers, ...), the ones who should get used to the status of "things can go wrong"? Except for some specific scenarios (medical-related stuff, for instance), if my favourite online shopping place is down, well, it's down, I'll buy later.

metadat4y ago

I know the Oracle OCI cloud has a reputation for never going hard-down, but also realize HN seems to loathe Big Red (understandably, to a degree, though OCI is pretty nice IME and _very_ predictable).

SixDouble53214y ago

I don't think it's unfair. They aren't the worst villain, but they are up there.

indigomm4y ago

> I doubt it, since outages and their effects are nuanced.

Your point here deserves highlighting. A failure such as a zone failing is nowadays a relatively simple problem to have. But cloud services do have bugs, internal limits or partial failures that are much more complex. They often require support assistance, which is where the expertise of their staff comes into play. Having a single provider that you know well and trust is better than having multiple providers where you need to keep track of disparate issues.

mongrelion4y ago

I agree with you. I think that having multi-AZ is the first thing to figure out before wanting to do multi-cloud, which is just another buzzword taken out of management's bullshit bucket :)

Hippocrates4y ago

Agree, and multi AZ is usually easy. IME with AWS and GCP the control plane is the same, the scaling works across AZ, bandwidth is free and latency is near zero. The level of effort to do that is simply ticking the right boxes at setup time IME.

1 more reply

jtc3314y ago

I’ve seen at least half a dozen full region AWS issues in the past 8 months.

You really need multi-region and also not be relying on any AWS service that’s located only in us-east-1 (including everything from creating new S3 buckets to IAM’s STS).

sfoley4y ago

Who says this? I have literally never once seen this.

hnarn4y ago

Is there a history of AWS downtimes available somewhere? This makes what, three times in as many months?

edit: The question isn't necessarily AWS specific, just any data on amount of downtime per cloud provider on a timeline would be nice.

colinbartlett4y ago

I have tons of this kind of data due to my side project, StatusGator. For some services like the big cloud providers I have data going back 7 years.

There indeed has been an uptick in AWS outages recently. You can see a bit of the history here: https://statusgator.com/services/amazon-web-services

exikyut4y ago

(I was idly curious. It appears this data is available as part of the ~US$280/mo tier, along with a bunch of other things.)

MatteoFrigo4y ago

I don't know about AWS, but both Google Cloud and Oracle Cloud maintain at least a high level history of past outages. See https://status.cloud.google.com/summary and https://ocistatus.oraclecloud.com/history

dijit4y ago

Given the hilariously awful reputation of the AWS status page I would hazard a guess that such a page would also be incredibly inaccurate.

If you can’t even admit you’re having an issue how can you keep an accurate record?

cassianoleal4y ago

Similar with GCP. We had a pretty bad outage once where the status page was showing all green. Google informed us that because the actual issue was further down the stack and didn't trigger any internal SLOs the status didn't get an update. It took them hours to acknowledge and fix it.

1 more reply

LuciusVerus4y ago

I'd say three times in as many weeks, give it or take

spmurrayzzz4y ago

This is a little more broad, beyond just cloud infra providers, but includes some of the kind of data you're looking for (post-mortems for outage events): https://github.com/danluu/post-mortems

andyjih_4y ago

The most hilarious irony of not being able to acknowledge a 4AM page in the PagerDuty mobile app because AWS is down.

exikyut4y ago

(Which was about AWS being down?)

JCM94y ago

AWS didn’t “go down”. They had an outage in one AZ, which is why there are multiple AZs in each region. If your app went down then you should be blaming your developers on this one, not AWS. Those having issues are discovering gaps in their HA designs.

Obviously it’s not good for an AZ to go down but it does happen and why any production workload should be architected to have seamless failover and recover to other AZs, typically by just dropping nodes in the down AZ.

People commenting that servers shouldn’t go down ect don’t understand how true HA architectures work. You should expect and build for stuff to fail like this. Otherwise it’s like complaining that you lost data because a disk failed. Disks fail… build architecture where that won’t take you down.

matharmin4y ago

AWS is under-reporting the severity of the issue though. The primary outage may be in a single AZ, but there are parts of the AWS stack that affected all AZs in us-east-1, and potentially other regions as well. For example, even now I'm unable to create a new ElastiCache cluster in different AZs of us-east-1.

zymhan4y ago

> I'm unable to create a new ElastiCache cluster in different AZs of us-east-1

Isn't that because Elasticache will distribute the cluster across AZs automatically?

https://docs.aws.amazon.com/AmazonElastiCache/latest/red-ug/...

1 more reply

boudin4y ago

Issues are across all us-east 1, not one AZ.

Load balancers are not doing well at all. The only way in this case to avoid an outage is to be cross regions or cross cloud which is quite more complex to handle and require more resources to do well.

And I hope that nobody is listening your blaming and pointing fingers advice, that's the worst way to solve anything.

It's AWS job to ensure that things are reliable, that there is redundancy and that multi-AZ infra should be safe enough. The amount of issues in US-EAST-1 lately is really worrying.

acdha4y ago

Some load balancers may be having issues but I have multiple busy workloads showing no issues all morning. One big challenge can be that some people reporting multi-AZ issues are shifting traffic and competing with everyone else, while workloads which were already running in the other AZs were fine. It can be really hard to accurately tell how much the problems you’re seeing generalize to everyone else.

I do agree that the end of this year has been a very bad period for AWS. I wonder whether there’s a connection to the pandemic conditions and the current job market – it feels like a lot of teams are running without much slack capacity, which could lead to both mistakes and longer recovery times.

2 more replies

phamilton4y ago

Echoing this. We had to manually intervene and cut off the faulty AZ because our ASGs kept spinning up instances in it and our load balancers kept sending traffic to bad hosts.

In the past I've seen both of those systems seamlessly handle an AZ failure. Today was different.

tluyben24y ago

> People commenting that servers shouldn’t go down ect don’t understand how true HA architectures work. You should expect and build for stuff to fail like this. Otherwise it’s like complaining that you lost data because a disk failed. Disks fail… build architecture where that won’t take you down.

Is that comparison fair? If you have 2 raid-5 mirrored raid 5 boxes in your room and all disks fail at the same time, you should complain. And that won't happen. These entire datacenter failures should be anticipated, but to expect them is a bit too easy I think. There are plenty of hosters who don't have this stuff even once for the last decade in their only datacenter. I do not find it strange to expect or even demand that level but to protect yourself if it happens in any case if that fits your specific project and budget.

Edit; OK meant that raid-5 remark in the same context as the hosting; it can and does happen but it shouldn't; you should plan for a contingency but expect it goes far. We never had it (1000s of hard-drive, decades of hosting, millions of sites) and so we plan for it with backups; if it happens it will take some downtime but it costs next to nothing over time to do that. If we expected it, we would need to take far different measures. And we had less downtime in a decade than aws AZ had in the past months. I have a problem with the word 'expect'.

phone86753094y ago

> Is that comparison fair? If you have 2 raid-5 mirrored raid 5 boxes in your room and all disks fail at the same time, you should complain. And that won't happen.

There are plenty of situations where this might happen if they’re in your room: a lightning strike can cause a surge that causes the disks to fry, a thief might break in and steal your system, your house might burn down, an earthquake could cause your disks to crash, a flood could destroy the machines, and a sinkhole could open up and swallow your house. You may laugh at some of these as being improbable, but I have seen _all_ of these take out systems between my times in Florida (lightning, thief, sinkhole, and flood) and California (earthquake and house fire).

The fix for this is the same fix as being proposed by the parent post - putting physical space between the two systems so if one place become unavailable you still have a backup.

1 more reply

acdha4y ago

> If you have 2 raid-5 mirrored raid 5 boxes in your room and all disks fail at the same time, you should complain. And that won't happen.

Here are some examples where that happened:

1. Drive manufacturer had a hardware issue affecting a certain production batch, causing failures pretty reliably after a certain number of power-on hours. A friend learned the hard way that his request to have mixed drives in his RAID array wasn’t followed.

2. AC issues showed a problem with airflow, causing one row to get enough warmer that faults were happening faster than the RAID rebuild time.

3. UPS took out a couple racks by cycling power off and on repeatedly until the hardware failed.

No, these aren’t common but they were very hard to recover from because even if some of the drives were usable you couldn’t trust them. One interesting dynamic of the public clouds are that you tend to have better bounds on the maximum outage duration, which is an interesting trade off compared to several incidents I’ve seen where the downtime stretched into weeks due to replacement delays or manual rebuild processes.

8note4y ago

More generally, any correlation between two items gives potential for a correlated failure.

Same manufacturer, same disk space, same location, same operator, same maintenance schedule, same legal jurisdiction, same planet, you name it, and there's a common failure to match

AtNightWeCode4y ago

"Won't happen". The 40,000 hours of runtime bug did happen. I would recommend people to take backups and store them offline or at least isolated from the main storage.

1 more reply

dylan6044y ago

>And that won't happen

HA! I had received new 16-bay chasis and all of the drives needed plus cold spares for each chasis. Set them up and started the RAID-5 init on a Friday. Left them running in the rack over the weekend. Returned on Monday to find multiple drives in each chasis had failed. Even with dedicated one of the 16 drives as a hot swap, the volumes would all have failed in an unrecoverable manner.

All drives were purchased at the same time, and happened to all come from a single batch from the manufacture. The manufacture confirmed this via serial numbers, and admitted they had an issue during production. All drives were replaced and at a larger volume size.

TL;DR: Drives will fail, and manufacturing issues happend. Don't buy all of your drives in an array from the same batch! It will happen. To say it won't is just pure inexeperience.

2 more replies

tyingq4y ago

>AWS didn’t “go down”

The context of the parent seems to be that they intermittently couldn't get to the console. That seems fair to me. If we're blaming developers and finding gaps in HA design, then AWS should also figure out how to make the console url resilient. If it's not, then AWS does appear to be down.

I imagine it's pretty hard to design around these failures, because it's not always clear what to do. You would think, for example, that load balancers would work properly during this outage. They aren't. Or that you could deploy an Elasticache cluster to the remaining AZs. You can't. And I imagine the problems vary based on the AWS outage type.

Similarly, with the earlier broad us-east-1 outage, you couldn't update Route53 records. I don't think that was known beforehand by everyone that uses AWS. You can imagine changing DNS records might be useful during an outage.

strunz4y ago

Except many AWS services still route through us-east-1 anyway, which is why they have had huge outages recently. AWS isn't as redundant as people think it is.

bencoder4y ago

Our API is just appsync (graphql) + lambdas + dynamoDB so, theoretically, we shouldn't have been affected. But about 1 in 3 requests was just hanging and timing out.

As others have said, they are not being forthright about the severity of the issue, as is standard.

dkryptr4y ago

100% agree. I'm actually surprised AWS hasn't built in a Chaos Monkey into their APIs/console so people can test their resiliency regularly if an AZ goes down.

edit: of course, AWS does have this: AWS Fault Injection Simulator

biohax20154y ago

AWS Fault Injection Simulator does this.

2 more replies

stingraycharles4y ago

Because then people would complain about AWS being less reliable than Azure / GCP.

TameAntelope4y ago

Here's a secret that's now saved me from three outages this month:

Be in multiple AZs, and even multiple regions but if you're going to be in only one AZ or one region, make it us-east-2.

IceWreck4y ago

Honestly my server at home has more uptime than US-East-1

TacticalCoder4y ago

I should blog about this one day but...

I have a server at OVH (not affiliated to them) which, at this point, I keep only for fun. It has 3162 days of uptime as I type this.

3 162 days. That's 8 years+ of uptime.

Does it have the traffic of Amazon? No.

Is it secure? Very likely not: it's running an old Debian version (Debian 7, which came out in, well, 2013).

It only has one port opened though, SSH. And with quite a hardened SSH setup at that.

I installed all the security patches I could install without rebooting it (so, yes, I know, this means I didn't install all the security patches for some required rebooting).

This server is, by now, a statement. It's about how stable Linux can be. It's about how amazingly stable Debian is. It's also about OVH: at times they had part of their datacenter burn (yup), at times they had full racks that had to be moved/disconnected. But somehow my server never got affected. It may have happened that at one point OVH had connectivity issues but my server went down.

I "gave back" many of my servers I didn't need anymore. But this one I keep just because...

I still use it, but only as an additional online/off-site backup where I send encrypted backups. It's not as if it gets zero use: I typically push backups to it daily.

They're only backups, they're encrypted. Even if my server is "owned" by some bad guys, the damage he could do is limited. Never seen anything suspicious on it though.

I like to do "silly" stuff like that. Like that one time I solve LCS35 by computing for about four years on commodity hardware at home.

I think it's about time I start to do some archeology on that server, to see what I can find. Apparently I installed Debian 7 on it in mid-october 2013.

I've created a temporary user account on it, which at times I've handle the password (before resetting it) to people just so they could SSH in and type: "uptime".

It is a thing of beauty.

Eight. Years. Of. Uptime.

nextaccountic4y ago

> Like that one time I solve LCS35 by computing for about four years on commodity hardware at home.

Awesome! Are you Bernard Fabrot [0]?

[0] https://www.csail.mit.edu/news/programmers-solve-mits-20-yea...

TacticalCoder4y ago

Yup that's me... I fear this (old by now) story blew my "tacticalcoder" cover.

kasey_junk4y ago

I read this as a cautionary tale. Here we have a server that only through the grace of god is still up, and is likely owned up. If it isn't, it's because of how little is going on with it.

At its current use, it's likely not a major issue but imagine if someone saw this uptime and thought to take it as a statement of reliability and built a service on it. I for one, would want that disclosed because this is a disaster waiting to happen. I'd much rather someone disclose that they had a few servers each with no longer than 7 days of uptime because they'd been fully imaged and cycled in that time...

TacticalCoder4y ago

It works both ways: it is also a cautionary tale for those who are prone to believe it's all unreliable cattle that needs constant restart because nothing is stable nor reliable...

plandis4y ago

Your server could just be an outlier. Doesn’t really say anything about AWS or any cloud provider.

BossingAround4y ago

Does your server at home handle similar traffic to that of US-East-1 since you're comparing uptime?

Simiarly, my laptop, if I keep it plugged in the wall, and enable httpd on localhost, will surely have better uptime than any of the top clouds. I'd bet that it'd have 100% uptime if I plugged in a UPS and cared for traffic on my local network only.

christophilus4y ago

Most people don't need to handle the traffic of US-East-1. They just need a single, simple, mostly reliable server. But they're often told, "Don't do that. It's too hard, and irresponsible, and what if you get a spike in traffic, and what if you need to add 5 new servers, and security is really hard."

In reality, most people don't need to scale. An occasional spike in traffic is a nuisance, but not the end of the world, and security is not terribly hard, if you keep your servers patched (which is trivial to automate).

I really don't understand why there's so much FUD around running your own stuff.

ryanbrunner4y ago

I think most people on here are coming from the perspective of startups, which scale out of a single server setup pretty quickly. At a bare minimum, most will have dedicated purpose-built servers like Redis or a DB, and often there's separate background workers, or a load balancer with a couple of web servers.

When your server requirements get into needing 5-6 servers (not at all atypical for a startup in their first year of being launched), running your own stuff becomes more of a challenge pretty quickly. Factor in 2-3x growth a year, and the challenges just mount.

1 more reply

Sammi4y ago

> Does your server at home handle similar traffic to that of US-East-1 since you're comparing uptime?

Of course it doesn't. Why are you asking antagonistic questions?

grumple4y ago

He asked it to demonstrate the point that uptime is trivial for one server with no traffic, and much harder at scale with auto scaling.

1 more reply

akyoan4y ago

> Honestly my server at home has more uptime than US-East-1

Is this not antagonistic? It's pointless to make these statements, so your parent comment pointed it out. Go downvote the first one instead.

loopdoend4y ago

Your home ISP has 100% uptime? That's incredible.

omh24y ago

Lets be real here, we don't need anywhere near 100% ISP uptime to beat AWS over the last couple months...

1 more reply

adamm2554y ago

Mines had 100% uptime for the past 2 months. I’ve had great value for money using a NUC for personal projects than public cloud subscriptions over the past few years.

BossingAround4y ago

I mentioned local network, didn't I...

IceWreck4y ago

No but I access my home-server remotely from my university all the time and it hasn't gone down once.

Better uptime than paying for EC2 on AWS US-East-1.

Obviously this approach isn't scalable but it serves me well.

amelius4y ago

> Obviously this approach isn't scalable but it serves me well.

It's perfectly scalable. Just give everybody their own home server.

RONROC4y ago

The prevailing wisdom throughout the last couple of years was:

“ditch your on-prem infrastructure and migrate to a major cloud provider”

And its starting to seem like it could be something like:

“ditch your on-prem infrastructure and spin up your own managed cloud”

This is probably untenable for larger orgs where convenience gets the blank check treatment, but for smaller operations that can’t realize that value at scale and are spooked by these outages, what are the alternatives?

TameAntelope4y ago

I don't think it's reasonable to be spooked by these outages, and to think your resolution would be to leave AWS entirely.

A much faster and more effective solution that doesn't have you trading cloud problems with on-prem problems (the power outage still happens, except now it's your team that has to handle it) would be to update your services to run in multiple AZs and multiple regions.

Get out of AWS is you want, but don't get out of AWS because of outages. You should be able to mitigate this relatively easily.

f6v4y ago

Self-managed infrastructure doesn’t fail now?

dijit4y ago

What an absolutely pointless comment.

Everything fails, we can argue the rate. But I would argue that understanding your constraints is better.

if you know that your secret storage system can't survive if a machine goes away: well, you wire redundant paths to the hardware and do memory mirroring and RAID the hell out of the disks. And if it fails you have a standby in place.

But if you use AWS Cognito.

And it goes down.

You're fucked mate.

f6v4y ago

It’s pointless to discuss how crappy cloud is whenever AWS goes down. Most of the businesses relying by the automatic RDS backups or EC2 auto scaling just don’t have time to think about all the underlying tech. I mean, I don’t manually allocate memory for variables anymore either. Do I get screwed when there’s a memory leak? Yes. What do I do about it? Move on.

1 more reply

plandis4y ago

If you think you can do better than AWS, GCP, Azure there is a lot of money to be made, for sure.

1 more reply

iso16314y ago

Not at this rate.

I remember we had a power outage in 2006, it actually took one of my services off air. Since then of course that has been rectified, and the loss of a building wouldn't impact on any of the critical, essential or important services I provide.

mbesto4y ago

> Not at this rate.

Source? Has there ever been an industry wide survey that compares availability from "insert average colo/data center operations" with the cloud ones?

And I'm not talking about "we have 12 SREs who are based in Cupertino and are all paid top dollar to support a colo"...I'm talking average.

2 more replies

ctvo4y ago

> Not at this rate.

And what rate is this? It gets attention because it impacts more people, but AWS / GCP / Azure uptime is still better than what I've seen for small / mid size businesses trying to manage their own infrastructure.

1 more reply

RONROC4y ago

We’re going to be having this same tired, pedantic, round-about conversation when Tesla’s routinely decide to take out a family of four because it mistook a plastic bag for an off-ramp.

Commenters will show up like clockwork and say shit like:

“What man, it’s not like cars didn’t crash before? Haha”

Don’t be dense dude. And definitely don’t pursue a leadership position anytime in the future.

deanCommie4y ago

Tesla fans are annoying, but it is absolutely valid that the safety bar for self-driving cars can't be "100% perfectly safe" - it needs to be "safer than the alternative".

The problem with both this example, and the AWS one (it needs to have better availability than your personal home-spun solution, and it does), is that people are amazing at deluding themselves.

"Yes, cars are dangerous, because other people can't drive. But I'm a better than average driver"

"Yes, other people will build unreliable systems. But I know how to architect for my use case and ensure that for my needs the availability will be higher than AWS's"

Both are true* in the micro sense and false in the macro sense.

* Not really. 88% of americans think they are "above average" drivers.

f6v4y ago

Well, anyone non-dense will tell you that most dense thing you can do is say “oh my god run for your lives” whenever there’s an outage. No statics, no cost-benefit analysis. Just commenting “Haha you can’t manage your own raids and ciscos what a noob” makes you a thought leader, yes.

xyst4y ago

“Hybrid and multi cloud” is the future. In other words, give us more fucking money.

paulryanrogers4y ago

Spread the risk? Smaller on prem and cloud / rented bare metal?

Spivak4y ago

Nah, it's actually better to concentrate the risk in this case.

If your app depends on a few 3rd party services -- SendGrid, Twilio, Okta and they're all hosted on different infra then congrats! You're gonna have issues when any one of them are down, yayyy.

Also the marketing benefit can't be downplayed. If your postmortem is "AWS was having issues" then your execs and customers just accept that as the cost of doing business because there's a built-in assumption that AWS, Azure, GCP are world class and any in-house team couldn't do it better.

1 more reply

ernsheong4y ago

Google Cloud seems to be doing much better, at least recently. There's also Azure. AWS seems to have placed growth above everything else at customers' expense.

Victerius4y ago

I'm tempted to found a startup to help businesses migrate from cloud providers to on-prem infrastructure.

datavirtue4y ago

Slinging some of that sweet Tanzu or Ranger?

potas4y ago

Slack seems to have some issues because of that - I'm not sure if anyone is receiving messages, as it became completely silent for the last 15 minutes or so.

jenoer4y ago

Sending and receiving messages works here, but editing them does not, it throws an error. Statuses such as "calling" also do not seem to be updated any longer.

Edit: Restarting Slack does update the edited messages.

Edit 15:24 CET: Slack is back up.

jakub_g4y ago

Same: only normal text seems kinda working

- edits failing or working with big lag;

- "Threads" view slow;

- can't emoji-react;

- can't upload images;

- people also say they can't join new channels.

1 more reply

darkwater4y ago

I fail to understand how a big player like Slack can be impacted this way by a failure in a single AZ in a specific AWS region. But at least the main feature (sending and displaying messages) is still working.

jakub_g4y ago

https://status.slack.com/2021-12/a17eae991fdc437d

> We are experiencing issues with file uploads, message editing, and other services. We're currently investigating the issue and will provide a status update once we have more information.

> Dec 22, 1:58 PM GMT+1

Pandabob4y ago

Uploading images doesn't work for me.

oneeyedpigeon4y ago

New messages seem to be ok for me, but editing old ones and uploading images both seem to be broken right now.

aden1ne4y ago

I can't edit messages, nor create channels. Messages are only received with a several minute delay.

izietto4y ago

I guess that's why I'm experiencing weird issues with Heroku:

    remote: Compressing source files... done.
    remote: Building source:
    remote: 
    remote: ! Heroku Git error, please try again shortly.
    remote: ! See http://status.heroku.com for current Heroku platform status.
    remote: ! If the problem persists, please open a ticket
    remote: ! on https://help.heroku.com/tickets/new

dijit4y ago

Yes.

Another thread: https://news.ycombinator.com/item?id=29648325

vegai_4y ago

5ish years ago it was common knowledge that us-east-1 is generally the worst place to put anything that needs to be reliable. I guess this is still true?

taf24y ago

I don't know about that. It was more like common knowledge that one availability zone in us-east-1 was a problem - you would have to figure out which one it was usually by spinning up instances in all 4 zones (now 6)... and that it was the largest of all regions making it ideal place to put your service if you wanted to be close to other vendors/partners in AWS...

beermonster4y ago

us-east-1 seems to be AWS’s not so well kept little dark secret!

In all seriousness though - even non-regional AWS services seem to have ties to us-east-1 as evidenced by the recent outages. So you might be impacted even if it looks like (on paper at least) you’re not using any services tied to that region.

thow-58d4e8b4y ago

Unfortunately, the fact that us-east-1 is roughly 10% cheaper than other regions usually overrides any other concerns

dolibasija4y ago

One of our EC2 instances in us-east-1c is unavailable and stuck in "stopping" state after a force stop. Interestingly enough, EC2 instances in us-east-1b don't seem to be affected.

The console is throwing errors from time to time. As usual no information on AWS status page.

JshWright4y ago

Instances stuck in the "stopping" state is pretty common, in my experience.

crescentfresh4y ago

The affected zone is use1-az4. Whatever that maps to (1a, 1b, 1c) is different per customer.

benedikt4y ago

you can find out which zone is mapped to use1-az4 for your account with awscli:

    aws ec2 describe-availability-zones | jq -r '.AvailabilityZones[] | select(.ZoneId == "use1-az4") | .ZoneName'

mnordhoff4y ago

Or if you open the EC2 console (it's up this time!) and scroll down to the bottom.

https://console.aws.amazon.com/ec2/v2/home?region=us-east-1#...:

(Edit: I hope I didn't sound sarcastic. I don't open random console pages and scroll all the way down to check for new features. Some people will have noticed, some won't.)

1 more reply

chrishynes4y ago

I had the same issue with unavailable, but on an instance in us-east-1b. Finally just got the force stop to go through a minute ago and it's now running and available again.

mike-cardwell4y ago

Your us-east-1b may be the parents us-east-1c.

The letters are randomised per AWS account so that instances are spread evenly and biases to certain letters don't lead to biases to certain zones.

chrishynes4y ago

Huh, that's interesting. Didn't know that, but makes sense.

2 more replies

throwaway9843934y ago

I'm not sure if we should say "AWS is down" if only us-east-1 is down. That region is more unstable than Marjorie Taylor Greene on a one-legged stool.

fivea4y ago

> I'm not sure if we should say "AWS is down" if only us-east-1 is down.

The thing is, us-east-1 represents the whole AWS for the majority of us.

1 more reply

CubsFan10604y ago

And only one AZ in us-east-1. But... it's clearly having a large impact as well.

300bps4y ago

The 1c part is meaningless. Those letters are randomized per customer to prevent letter biases from leading to more people in 1a for instance.

crescentfresh4y ago

Was stuck on stopping in us-east-1b. Cannot start now.

ClumsyPilot4y ago

Now that everyone and their dog is on AWS, it is not just 'a website stops working', half the world, from telephones to security doors and Iot equipment, stops working?

I am not sure if the movement the cloud has reduced amount of failures, but it definitely has made these failures more catastrophic.

Our profession is busy makin the world less reliable and more fragile, we will have our reconning just like the shipping industry did.

dehrmann4y ago

It's more like it's making downtimes correlated rather than random. For everything other than urgent communication, I'm not sure if this is a big deal.

madeofpalk4y ago

all I've noticed is slack was a bit unreliable for a little bit, but i just carried on and otherwise ignored it. my world did not stop working.

ClumsyPilot4y ago

My apartment block has a dialing system, that, instead if using a cale that goes to your apartment, relies on IP telephony and calls your mobile phone. It stos working if there is no internet, or your phone is out of battery, or you are not home but your wife is.

1 more reply

KronisLV4y ago

Same, maybe that was a related issue.

Today, on Slack i could not edit messages, could not edit statuses and could not post attachments. Pretty annoying!

schnebbau4y ago

So, how many execs are going to push to move to self-managed hosting in the new year?

Packaging a way to migrate off AWS could be a unicorn idea.

qwertyuiop_4y ago

None. Amazon hired all ex VPS, CTOs, Directors of small, medium large companies with Rolodexes.

mikece4y ago

Would need one hell of a compressional algorithm to keep the data exfiltration costs down.

pm904y ago

Pied Piper

adamm2554y ago

Anyone using VMware Cloud services is probably laughing. Just chuck it at Azure or GCP or back on prem.

dehrmann4y ago

Depends on how many customers are ready to move to a different vendor. I suspect most customers are forgiving because either they were also down or half the services they use were down. You don't get fired for hosting in AWS.

wallacoloo4y ago

AWS has its Outpost product for on-prem hosting. not 100% self-managed, but maybe enough to satisfy the execs and make your market a bit smaller.

Nextgrid4y ago

Does it come with its own locally-hosted console or does it still rely on the main AWS control plane? If the latter then it could be affected too.

rsp19844y ago

Bitbucket having issues too: https://bitbucket.status.atlassian.com/

captn3m04y ago

4:35 AM PST We are investigating increased EC2 launched failures and networking connectivity issues for some instances in a single Availability Zone (USE1-AZ4) in the US-EAST-1 Region. Other Availability Zones within the US-EAST-1 Region are not affected by this issue.

via https://stop.lying.cloud/

junon4y ago

Can anyone explain the affiliation of stop.lying.cloud to Amazon? All of the legalese in the header/footer seem to indicate it's actually owned and run by Amazon. If so... why? Why not just... use the real status page?

I mean I'm glad it exists, don't get me wrong. Just weird that they'd have two status pages, one seemingly existing only to sort of 'mock' themselves...

taspeotis4y ago

The people who maintain the unofficial site would have, at some point, used their CTRL and C keys followed not immediately, but closely by, their CTRL and V keys.

junon4y ago

But that is copyright infringement. You're not allowed to copy some work, modify it, then slap the original copyright on it. This is an illegal website, prone to being taken down by AWS.

It's just strange.

5 more replies

jrumbut4y ago

I was curious too. An HN user takes credit for it here: https://news.ycombinator.com/item?id=24499159

Apparently it does some simple transformations of the actual status page, which is why the Amazon copyright stuff is in there.

deadbunny4y ago

FWIW `lying.cloud` is registered with Namecheap. `amazon.com`/`aws.com`/`amazon.ca` are all registered with Mark Monitor. And I know that AWS uses ghandi behind the scenes for domain reg. Given that, I'd hazard a guess that it's not owned by Amazon. Definitely not a guarantee though.

IceWreck4y ago

Amazon's own status page sort of lies. So someone probably wget-ed the status page, kept the same html and css and hooked it to their own API to display correct info.

anpat4y ago

Corey Quinn (https://twitter.com/quinnypig) runs it.

He also has a decent newsletter and witty commentary, for all things AWS.

andyjih_4y ago

It's not official. The people making the page probably just copied everything, including the legalese.

bithavoc4y ago

I think it was built[0] by @quinnypig

[0] https://twitter.com/quinnypig/status/1468331194471178241?s=2...

snth4y ago

What is this website? Is there an "about" or something? What is it doing differently from the official AWS status page?

mule14y ago

Feel for devops peeps who are just trying to chill for Christmas

stunt4y ago

It seems that it's due to powerloss.

[05:01 AM PST] We can confirm a loss of power within a single data center within a single Availability Zone (USE1-AZ4) in the US-EAST-1 Region. This is affecting availability and connectivity to EC2 instances that are part of the affected data center within the affected Availability Zone. We are also experiencing elevated RunInstance API error rates for launches within the affected Availability Zone. Connectivity and power to other data centers within the affected Availability Zone, or other Availability Zones within the US-EAST-1 Region are not affected by this issue, but we would recommend failing away from the affected Availability Zone (USE1-AZ4) if you are able to do so. We continue to work to address the issue and restore power within the affected data center.

pawelduda4y ago

Bitbucket is affected, pages randomly take forever to load or return 500

Pandabob4y ago

Yep, just botched a merge likely because of this.

el_duderino4y ago

Bitbucket just completed their migration to AWS too. Rough start.

darkwater4y ago

Fields of green here https://status.aws.amazon.com/ Anyway I can access the web console with no issue (eu-west)

hnarn4y ago

I think it's pretty widely accepted that AWS' own status pages are utterly useless.

s_dev4y ago

You would think that but there always a few contrarian AWS evangelists in the comments going on about the "difficulty" in operating a status page as though it were trying to conjure a N=NP proof.

Like how come down detector can do a superb job of detecting when AWS goes down and AWS can't? Because AWS doesn't want account managers of SLAs asking for credits for the uptime they're paying for but not getting.

https://downdetector.co.uk/status/aws-amazon-web-services/

darkwater4y ago

Yeah, it was just to confirm that this time was no different :)

hdjjhhvvhga4y ago

In Russia they have a specific name for it:

https://en.wikipedia.org/wiki/Potemkin_village

lordnacho4y ago

The elite DevOps teams are always assigned to the status page

temp08264y ago

Changes to this page require very high level management approvals (source: used to work at aws)

JCM94y ago

Status page says there are issues. It’s not all green.

oneeyedpigeon4y ago

Now. It took a lot longer for that page to know/admit the problems than it did half the internet.

anshumankmr4y ago

If AWS, GCP and Azure go down, we will be back in the stone ages, right?

dijit4y ago

The only stuff that will work will probably depend on things in AWS in some form.

That, or people never took the “if AWS goes down then lots of people will have a problem, so we’ll be fine” line seriously; there are few such cases.

omosubi4y ago

I do wonder if the great resignation has anything to do with this. My team (no affiliation with Amazon) was cut in half from last year and we are struggling to keep up with all the work

sctgrhm4y ago

Invision image uploads are down too because of this : https://status.invisionapp.com/

camdenreslink4y ago

Who needs chaos monkey? Just host on AWS for a similar effect.

gtsop4y ago

Question to the sysadmins here: Is it really that outrageous of amazon to have such issues or are people way to spoiled to appreciate the effort that goes into maintaining such a service?

Edit: Not supporting amazon, i generally dislike the company. I just don't understand the extend to which the criticism is justified

dsr_4y ago

The issue is in three parts:

1. Did AMZN build an appropriate architecture?

2. Did AMZN properly represent that architecture in both documentation and sales efforts?

3. What the heck is going on with AMZN?

Let's say that they build an environment in which power is not fully redundant and tested at the rack level, but is fully redundant and tested across multiple availability zones. Did they then issue statements of reliability to their prospective and existing customers saying that a single availability zone does not have redundant power, and customers must duplicate functionality in at least 2 AZs to survive a SPOF?

rswail4y ago

So why are people not migrating out of us-east-1? Operating in ap-southeast, we weren't that affected by the us-east-1 down time, although our system is reasonably static and doesn't make lots of IAM calls (which seems to be a large SPOF from us-east-1).

dijit4y ago

Some “global” systems run in us-east1 even if you’re not hosted there a service you depend on might be.

Notably: cognito, r53 and the default web UI. (You can work around the webui one I’m told, by passing a different domain instead of just console.aws.amazon.com)

watermelon04y ago

Don't forget about CloudFront, which can only be configured via us-east-1.

taf24y ago

latency. us-east-1 is positioned very nicely relative to many large businesses in North America and Europe. This gives you pretty good access to a very large percentage of the economies of the world with good latency... while not requiring you to architect your application around multiple regions...

reactive554y ago

Bitbucket is down as well because of this. https://bitbucket.status.atlassian.com/incidents/r8kyb5w606g...

sprite4y ago

My Elastic Beanstalk instances are completely unreachable. Seems at the very least ELB is down. Looking @ down detector it looks like this is taking a bunch of sites down with it. As usual AWS status page shows all green.

exabrial4y ago

As an industry, can we please stop making products like vacuums that can't operate unless someone else's computer is working in a field in Virgina? There's literally no reason for it.

antihero4y ago

I wonder how many 9s AWS is going for. Can't be a lot of 9s anymore.

arh684y ago

89.9999 % has a lot of 9s, dare I say military-grade.

yabones4y ago

Nine Fives is the new Five Nines!

loudtieblahblah4y ago

Yay! Adult snowday!

RobertKerans4y ago

Apropos of nothing, but a few Christmasses ago the place I worked had a dedicated fibre line that some workmen doing gas line repairs sawed straight through, took out everything; I was just drone worker at the time & it was a beautiful thing

exogenousdata4y ago

Looks like the SEC's Edgar website is affected. This is the site the SEC uses to post the filings of public companies. Normally there are a hundred or more company filings in the morning starting at 6am ET. This morning there are two.

https://www.sec.gov/cgi-bin/browse-edgar?action=getcurrent

debarshri4y ago

Hubspot seems to be down too [0].

[0] https://status.hubspot.com/

amai4y ago

Thank goodness we host all IT services in the same cloud. Imagine the chaos we had if everything would not fail at the same time.

iso16314y ago

Ahh, the cloud

https://imgflip.com/i/5yrt24

lukeqsee4y ago

I can't get to the console either, receiving a "Temporarily unavailable" notice without branding.

sascha_sl4y ago

quay.io is also dead, as well as giphy, some parts of slack

just the weekly internet apocalypse, happy holdidays fellow SREs

richardfey4y ago

As far as I understood a whole availability zone went down; today is also the day a lot of people understand why "multi-AZ" matters, so I don't think it's fair to say that services are down because the whole AWS is down.

jakub_g4y ago

Where are you located? "X is down" without location is only moderately useful.

I'm having issues with Slack from central EU (Poland) -- can't upload images, or send emoji reactions to post; curiously, text works fine). Wondering if linked

riknoxOP4y ago

AWS Console runs in us-east-1 so that points to at least that region having issues IIRC. I am also having Slack issues in EU.

hdjjhhvvhga4y ago

You should complain to Slack then. It's their problem to choose a reliable provider, and AWS seems to have trouble with keeping this status.

kemals4y ago

Here is The Internet Report episode on the topic of recent AWS outages that covers outage and root causes: https://youtu.be/N68pQy8r1DI

bob10294y ago

2 of our servers are fucked right now. VOIP services down.

Only with AWS and Github do I seem get panicked text messages on my phone first thing in the morning... Our workloads on Azure typically only have faults when everyone is in bed.

fipar4y ago

https://downdetector.com/status/aws-amazon-web-services/

devoutsalsa4y ago

We'll never really know the answer, but I have to wonder what percentage of comments on this thread are from Amazon downplaying the severity & other cloud providers hyping it up.

mongrelion4y ago

You give HN too much credit.

j10c4y ago

I also had problem with loading youtube at the same time(for 10-15 minutes) . It looks like a coincidence, but who knows if google uses some of the infrastructure from aws.

pkulak4y ago

I used to think it was silly to have your own hardware (like a NAS) in your house. What makes you think you can do it better than AWS?

Santa is bringing me a Synology in three days.

darkstar9994y ago

Why not both? I just got a Synology NAS and it makes cloud sync dead simple. Now the most important things are on my PC, mirrored on 2 drives in my NAS, and on AWS S3 (or any other cloud storage).

pkulak4y ago

Oh yeah. My plan is to migrate everything to the NAS, then have that back up to Glacier and/or Rsync.net. By S3, do you mean Glacier?

1 more reply

RobertKerans4y ago

Assuming crates.io is AWS-backed? Getting fun situation where direct dependencies of an application are downloading but then the sub-dependencies aren't.

lukeqsee4y ago

crates.io is directly hosted on GitHub, but I'm sure some dependencies use S3 or other AWS services for things.

pietroalbini4y ago

The crates.io index is hosted on GitHub, but the application/API is hosted on Heroku (so in the us-east-1 AWS region) and the downloads on S3/CloudFront. And yes crates.io is currently impacted.

RobertKerans4y ago

Yep, S3 possibly the villain here

withinboredom4y ago

I wonder if there's an s3 compatible service with similar pricing that can be used as a fallback? Are digital ocean s3 compatible storage accounts's backed by real s3?

2 more replies

RobertKerans4y ago

Ah, back to normal now. Getting intermittent flickers on some of our apps but all seems solid-ish again

mwcampbell4y ago

Yeah, and I can't publish a crate.

kingsloi4y ago

Of all the AWS outage, my team and I have dodged them all, except this one. 3 instances down and unavailable

> Due to this degradation your instance could already be unreachable

>:(

electroly4y ago

FWIW I don't think that message has anything to do with this outage. I think it's just a coincidence that you got some degraded hosts. They didn't send out emails like that for this AZ outage (nor would I expect them to -- that email is for when host machines die).

bobviolier4y ago

Seems unlogical that this is just a single region in a single US region We are having issues pulling images from public.ecr.aws from an EU region.

saxonww4y ago

I don't know what's still true, but at one point us-east-1 seemed more critical than other regions because there were some things that had to be there. One thing that comes to mind is ACM certificates used with things like API Gateway (probably Cloudfront), they had to be in us-east-1 no matter where the rest of your infrastructure was.

So it's not shocking to me that something going down in us-east-1 could have impact on other regions.

l0b04y ago

Meta: I posted a "PyPI is down" link a few days ago, and the post got insta-flagged. Is there some rule about this sort of thing?

sswaner4y ago

Not down as of 7:40 EST. US-EAST-1 hosted site (athene.com). Cognito, API Gateway, Lambda, S3, DynamoDB, RDS, S3, Cloudfront.

throwaway8754874y ago

Our RDS instances have completely packed up. Hell knows what's going on. Here come the customer support tickets.

anonu4y ago

Better polish off your BCP docs. People will be asking for them quite a bit more in the new year.

sprite4y ago

My app running on AWS is currently down. Having intermittent problems with console as well.

dugmartin4y ago

I'm getting a plain "504 Gateway Time-out" page when trying access anything past the console homepage in us-east-1.

stevehawk4y ago

also having console issues in us-east-1, bitbucket is randomly throwing bad gateways at me

streamofdigits4y ago

Somebody call the IT department

allocate4y ago

Also running a big production app in east-1 and we're experiencing issues.

sprite4y ago

I'm also in east-1 and completely down.

throwaway815234y ago

Ok, enough AWS outages to say I'm tired of hearing about low end stuff being flaky.

BiteCode_dev4y ago

"Don't use a self hosted monolithe, it's not reliable! You need a cloud FS with a load balancer under observability and your data in a db that scales horizontally, all orchestrated by kubs."

Meanwhile, I currently have a gig to work on a video service which features a never updated centos 6, an unsupported python 2 blob website, and a push to prod deployment procedure, running a single postgres db serving streaming for 4 millions users a month.

And it's got years of up time, cost 1/100th of AWS, and can be maintained by one dev.

Not saying "cloud is bad", but we got to stop screaming old techs are no good either.

osrec4y ago

Purely out of interest, I'd like to know more about your streaming architecture. I assume postgres just holds the meta data, and the actual video content is stored elsewhere? What strategies have you employed to scale the streaming part of your service? I imagine 4 million users a month is quite a significant amount of traffic!

BiteCode_dev4y ago

1 - For the last 10 years, servers have been beasts. You have a lot of cores, plenty of HD and RAM. Servers are less expensives than devs. Scaling vertically can go VERY far.

2 - Caching is life. We have 3 layers of caching: cloudflare, varnish, and redis. Most things don't need to be real time. A lot of things can be a month old and the user doesn't care. User need immediate feedback to be happy, but not necessary fresh data.

3 - if you compile nginx manually, you get to use a lot of plugins that can do stuff super fast, including serving videos. You can script stuff in lua that will just skip the backend completly.

4 - mind your encoding. We carefully chose how we encode videos. The ffmpeg parameters are pretty insane, but the space / quality ratio is amazing, espacially on mobile. It takes a lot of time to experiment with those, nobody share them :)

5 - we offload everything we can to cron tasks or task queues. Including, obviously, encoding, screenshooting, etc.

6 - don't hold data you can't lose. E.G: billing. This way you can have a relaxed attitute toward data. If we ever loose a day of business, users will be in a bad mood for a week, but that won't be the end of the world. We don't need a bullet proof system if bullets can't kill us.

7 - give money to ffmpeg and opencv, because damn those things are fast. And good.

8 - servers are hosted accross 2 providers. This way, if one goes down, or decide to stop doing business with us Google style, we have a second one. Happened recently with leaseweb: they shutdown a whole room without offering an alternative.

E.G: votes.

They don't hit the backend on write. We pile them from nginx to redis, then once a day, we aggregate and store on postgres, which the backends will consumme. We just store each vote on localstorage as well so that the user feels like it's real time when they vote, but in reality it's updated once a day. But votes don't affect the money side of our business, so if we lose them one day, it does not mean death.

P.S: yes, posgres/redis/elasticsearch only hold metadata. Videos are stored on disk. There is no docker images, no mircoservices, FS is ext4. Which means with a lot of RAM, the OS FS cache will have most popular videos already loaded and ready to be streamed. Everything is raid 0, so if we get one disk corrupted, you lose the server. But we upload each videos on severeal servers, so when a disk get corrupted, we just replace the whole server. In fact, anything goes wrong on a server, we replace it. It's not worth it to find the root cause, unless 2 servers die in the same way successively.

2 more replies

henriquez4y ago

Heroku isn’t “low end,” it’s a PaaS built on top of AWS. So you’re really just hearing about another AWS outage lol

christophilus4y ago

They're not saying Heroku is low end. They're saying, "I'm tired of hearing that it's irresponsible to run your own servers."

At least, that's what I understood.

ryanbrunner4y ago

Any place I've worked at that managed their own servers (to be fair, the last time I worked at a place like that was 2010) definitely had more protracted downtimes than AWS - it just felt not as bad because we were in control of the situation, but at the end of the day that didn't get us up any faster.

Another side benefit of being with AWS is when you do have an outage, a lot of other people have outages, and so you sort of blend in with the noise. It's not great to be down, but if you're down and also "big service X" who's also an AWS customer is down, it makes your downtime look less like a lack of competence and more like an unavoidable force of nature.

2 more replies

mijoharas4y ago

This comment doesn't say anything about heroku?

jacob0194y ago

Right. I've had an excellent experince with Vultr for the last couple years, for about 1/10th the cost of AWS. I use other small VPS providers as well. I run my own small business and I need to keep costs down to stay competitive. I used to use AWS more but the bill always creeps up to inappropriate levels. AWS billing is insulting, oh you forgot to renew your reserved instance? That's going to be double this month. I still use cloudfront, route 53, and a few of the smallest instances for mail servers and asterisk though. It's foolish to go all in with AWS, or with anything really.

api4y ago

Nobody ever got fired for using AWS.

alecbz4y ago

I wonder to what extent this actually becomes less of a problem the more people use AWS. At this point AWS being down just feels like "the internet is down", it's hard for customers to be too mad at any company being down when all their competitors are too.

Though I guess there's still probably just lost revenue that could be captured by having better uptime, even if your competitors are down.

AQuantized4y ago

This seems like an interesting pendulum swing where the few companies not reliant on AWS could capture significant enough revenue by maintaining uptime during a potential busy season outage.

4 more replies

trabant004y ago

True sad fact. I first thought it is a management problem but lately I see it is the tech bros who push for fads in the hopes of staying relevant and not asuming responsability for choices.

Jupe4y ago

(Accidentally down-voted, apologies! I would upvote to fix, but can't... Update: fixed)

Agreed. Arguably, not using an existing cloud service is a red flag on any new hires. AWS being the primary, but experience using GCS or Azure are at least viable skills, even if your business is AWS-based.

But the "fad-based-development" meme is not going away any time soon. The incentives in the business are built around it (really! No one want's to work on a boring old relational database solution any more). In the old days it was 4th generation languages, RUP, XML and Function Point Analysis... today it's functional programming, SDKs, big-three cloud PaaS experience or (shudder) block-chain.

I think back to my much younger self, when I thought that technology was something to be mastered to solve real-world problems, and I laugh. Little did I know the real problem to be solved was to figure out how to solve those same-old business problems but with the technology of the season (Kubernetes, GraphQL or ML).

datavirtue4y ago

Omg, this needs to be on a plaque or something.

"Let's move our internal app with 50 users to k8s in the cloud." --true story

1 more reply

debarshri4y ago

Today DO also went down. We could not login briefly.

jeremyjh4y ago

Just the control panel or were your instances down as well?

1 more reply

pxue4y ago

maybe except a team at google? ;)

falcolas4y ago

I have a story from only a few years ago where the finance section, and a good portion of management, of Google had no idea how poor their GAE solution was for uptime, until they tried to do business critical work using software that was hosted on GAE.

Uptime improved rather dramatically after that.

1 more reply

flatiron4y ago

If you rely solely on east 1 maybe?

omh24y ago

AWS doesn't follow their own advice about hosting multi-regional.

When us-east-1 is sufficiently borked the management API and IAM services in all regions tend to go down with it.

Static infrastructures usually avoid the fallout, but anyone dependent on the API or otherwise dynamically created resources often get caught in the blast regardless of region

1 more reply

bognition4y ago

What a way to start my day

300bps4y ago

Can we please stop saying, “AWS is down”?

AWS consists of over 200 services offered in 86 availability zones in 26 regions each with their own availability.

If one service in one availability zone being impaired equals a post about “AWS is down” we might as well auto-post that every day.

omh24y ago

AWS doesn't follow their own advice about hosting multi-regional so every time us-east-1 has significant issues pretty much every AZ and region is affected.

Specifically large parts of the management API, and IAM service are seemingly centrally hosted in us-east-1.

If your infrastructure is static you'll largely avoid the fallout, but if you rely on API calls or dynamically created resources you can get caught in the blast regardless of region

satya714y ago

Seems enough services in us-east-1 are down to cause most apps to fail. My simple app uses 10s of AWS services, at least some of which are out.

300bps4y ago

I may have seen more of these posts than you. The last one I saw where “AWS is down” was us-west-1.

KptMarchewa4y ago

Would be cool if this wasn't the region where AWS hosts their internals, making other regions unusable, right?

sawmurai4y ago

It's like my grandma saying "Honey, the internet is broken again." xD

biznickman4y ago

Why isn't Heroku showing a status error despite being offline?

mikece4y ago

Because it's built on AWS and uses the AWS status page for it's status info?

sreitshamer4y ago

Console is sluggish for me, but S3 (us-east-1) seems to work fine.

ChrisMarshallNY4y ago

I can't play Borderlands 3 this morning (Epic).

Wonder if it's connected?

13daug4y ago

This S3 how you gonna get you investment back from it

networkisfine4y ago

Isn't the point of the design of an availability zone having multiple data centers so that if a single data center in the availability zone fails, services aren't affected?

1 more reply

Demcox4y ago

Imgur is suffering from this too, I think.

amai4y ago

A problem with log4j/logshell?

whoomp123424y ago

the cloud is great they said...

tomerbd4y ago

Rumble was up all this time.

reactive554y ago

Bitbucket is down as well

exabrial4y ago

Stat That.

quantumfissure4y ago

Me: Hesitation at last job moving absolutely everything (including backups) to AWS because if it goes down it's a problem I'm a firm believer in some kind of physical/easily accessible backup.

Coworkers: "You're an f'n idiot. Amazon and Facebook don't go down, you're holding us back!" <-Quite literally their words.

Me: leaves cause that treatment was the final straw

Amazon and Facebook both go down within a month of each other, and supposedly they needed backups

Them: shocked pikachu face

kalleth4y ago

I'd be surprised if they needed backups for a few hours of downtime with (reportedly) complete recovery where no data was corrupted. There are industries where this would be required, and it's possible I guess, but neither of these downtime events were "data loss" events, just availability events for short-ish periods of time that wouldn't - for me - result in activating our DR plans.

I must admit that I do always try and maintain a separate data backup for true disaster recovery scenarios - but those are mainly focused around AWS locking me out of our AWS account (and hence we can't access our data or backups) or recovering from a crypto scam hack that also corrupts on-platform backups, for example.

aeonflux4y ago

I once had to argue that we still do need backup even though S3 has redundancy. They laughed when I mentioned a possible lock-up from AWS (even due to a mistake or whatever). I asked what if we delete data from app by mistake? They told me we need to be careful not to do that. I guess I am getting more and more tired of arrogant 25 years old programmers with 1-2 years in industry and no experience.

8 more replies

hinkley4y ago

AWS has had at least one documented incident where a region had an S3 failure that was not recoverable. They lost about 2% of all data. That might not sound like much but if you have a lot of data, partial restoration of that data doesn't necessarily leave your system in a functional state. If it loses my compiled CSS files I might be able to redeploy my app to fix it. Then again if I'm a SaaS company and that file was generated in part from user input, it might be more difficult to reconstruct that data.

2 more replies

numbsafari4y ago

Today's gentle reminder that there are things other than network or service outages that can and do occur that might necessitate an outside backup.

What happens if AWS or [insert other megacloud] decides your account needs to be nuked from orbit due to a hack or some other confusion? We almost had this happen over the summer because of a problem with our bank's ability to process ACH payments. Very frustrating experience. Still isn't fully resolved.

What happens if an admin account is taken over and your account gets screwed up?

What happens if an admin loses his shit and blows up your account?

What happens if your software has a bug that destroys a bunch of your data or fubars your account?

There's a ton of cases where having at least a simple replica of your S3 buckets into a third-party cloud could prove highly valuable.

btown4y ago

Would you be able to expand at all about the ACH/AWS connection, obviously without identifying details?

Was it just a miscommunication around AWS billing and them thinking you weren't paying? Or did AWS somehow put itself in the middle of, or react to, your use of ACH payment processing for *non-AWS* receivables or payables?

If the latter, that's a business risk I'd never even thought about. I'm not even sure how they'd know. But I'm thoughtful that things like the MATCH list [0] exist, and how easily a merchant can accidentally wind up on these lists from either human error or a small amount of high-value chargebacks. If cloud providers are somehow paying attention to merchant services reputation, that would be very scary for many businesses!

[0] https://www.merchantmaverick.com/learning-terminated-merchan...

1 more reply

hinkley4y ago

I would make a friendly wager that AWS user IDs don't contain check digits, let alone bullet proof ones (simple check digits don't guard against transposition errors). And that somewhere, someone can manually enter an account to delete, and that one of us will eventually have an account numbered XXX1234 and some idiot with account XXX1243 will legitimately earn an account deletion, but we'll be the ones who wake up to bad news.

lmilcin4y ago

Think about it this way:

1) Can you make your on prem infrastructure go down less than Amazon's?

2) Is it worth it?

In my experience most people grossly underestimate how expensive it is to create reliable infrastructure and at the same time overestimate how important it is for their services to run uninterrupted.

EDIT: I am not arguing you shouldn't build your more reliable infrastructure. AWS is just a point on a spectrum of possible compromises between cost and reliability. It might not be right for you. If it is too expensive -- go for cheaper options with less reliability.

If it is too unreliable -- go build your own yourself, but make sure you are not making huge mistake because you may not understand what it actually costs to build to AWSs level.

For example, personally, not having to focus on infra reliability makes it possible for me to focus on other things that are more important to my company. Do I care about outages? Of course I do, but I understand doing this better than AWS has would cost me huge amount of focus on something that is not core goal of what we are doing. I would rather spend that time thinking how to hire/retain better people and how to make my product better.

And adding all that complexity of running this infra to my company would cause entire organisation be less flexible, which is also a cost.

So you can't look at cost of running the infra like a bill of materials for parts and services.

And if there is an outage it is good to know there is huge organisation there trying to fix it while my small organisation can focus preparing for what to do when it comes back up.

patentatt4y ago

On the other hand, perhaps the large cloud providers bring a level of complexity that outweighs their skills at keeping everything up. What I mean is, a basic redundancy and failover setup with two data centers is kind of straightforward. Sure you need a person on call 24/7 to oversee it, but it's conceptually not that complicated. And if you're running bare metal, you get a surprising level of performance per dollar and rack unit. On the other hand, the big clouds are immensely complex with multiple layers of software defined networking, millions of tenants, thousands of employees, acres of floor space, org charts, etc. If you're running your own infra as one competent sysadmin, you know nobody else in another department will push a networking code change that will break your shit in the middle of the night. Maybe it's not right for everyone, but it's not unreasonable to go on prem in 2021 despite the popular opinions otherwise. Source: my company runs on prem and routinely has 100% uptime years. Most unplanned downtime occurs early on a Sunday morning following a planned action during a maintenance window.

1 more reply

Nextgrid4y ago

> Can you make your on prem infrastructure go down less than Amazon's?

Obviously depends on what you need, but for a small to medium web app that needs a load-balancer, a few app servers, a database and a cache, yes absolutely - all of these have been solved problems for over a decade and aren't rocket science to install & maintain.

> Is it worth it?

I'd argue that the "worth" would be less about immunity to occasional outages but the continuous savings when it comes to price per performance & not having to pay for bandwidth.

> overestimate how important it is for their services to run uninterrupted.

Agreed. However when running on-prem, should your service go down and you need it back up, you can do something about it. With the cloud, you have no choice but to wait.

laumars4y ago

I have run high availability (HA) systems in prem and your statement vastly understates the difficulty and expense.

You need multiple physical links in running to different ISPs because builders working on properties further down the street could accidentally cut through your fibre. Or the ISP themselves could suffer an outage.

You need a back up generator and to be a short distance away from a petrol station so you can refuel quickly and regularly when suffering from longer durations of power outages. You absolutely do not want to run out of diesel!

You need redundancy of every piece of hardware AND you need to test that failover works as expected because the last thing you need is a core switch to fail and traffic not to route over secondary core switch like expected.

You need your multiple air con units and them to be powered off different mains inputs so if the electrics fail on one unit it doesn’t take out the others. I guarantee you that if the air cons will fail, it will be on the hottest day of the year a month amount of portable units will stop your servers from overheating.

You need beefy UPS with multiple batteries. Ideally multiple UPSs with each UPS powering a different rail on your racks so that if one UPS fails your hardware is still powered from the other rail. And you need to regularly check the battery status and loads on the UPS. Remember that the back up generator takes a second or two to kick in so you need something to keep the power to the servers and networking hardware to be uninterrupted. And since all your hardware is powered via the UPS, if that dies you still lose power even if the building is powered.

And you then need to duplicate all of the above in second location just in case the first location still goes down.

By the way, all of the possible failure points I’ve raised above HAVE failed on me when managing HA on prem.

The reason people move to the cloud for HA is because rolling your own is like rolling your own encryption: it’s hard, error prone, expensive, and even when you have the right people on the team there’s still a good chance you’ll fuck it up. AWS, for all its faults, does make this side of the job easier.

6 more replies

badams25274y ago

Human capital side would disagree with that I think. You're assuming the organization which owns this small/medium web app has the personnel already on staff to handle such a thing.

If you're outsourcing that, you'd likely have to pay a boatload just for someone to be available for help, let alone the actual tasks themselves. Like you said, if you're on-prem and something goes down, you can do something. But you've gotta have the personnel to actually do something.

That said, I think you're spot-on as long as you have the skillset already.

2 more replies

jerf4y ago

I think if you put a bit of effort into classifying importance, you can likely justify backing up certain critical systems in more than one way. Let "the cloud" handle everyone's desktop backups and all the ancillary systems you don't really need immediately to do business, but certain important systems should perhaps be backed up both to the cloud and locally, like Windows Domain Controllers and other things you can't do anything without.

Backup is cheap when you're focused about what you're backing up.

In this case, the game isn't "going down less than Amazon", it's about going down uncorrelated to Amazon. Though that's getting harder!

"In more than one way" doesn't have to be local, but it may be across multiple cloud services. Still, "local" is nice in that it doesn't require the Internet. ("The Internet" doesn't tend to go down, but the portion you are on certainly can.) Of course, as workers disperse, "local" means less and less nowadays.

1 more reply

Retric4y ago

It really spends on how reliable you need to be. Don’t forget you get downtime from both AWS and your own issues so even 4 9’s is off the table with pure AWS. If you need to be more reliable than AWS you need to run a hybrid inside and outside of AWS which means most of the advantages of running on AWS goes away.

1 more reply

pkulak4y ago

> Can you make your on prem infrastructure go down less than Amazon's?

Over the last two years, my track record has destroyed AWS. I've got a single Mac Mini with two VMs on it, plugged in to a UPS with enough power to keep it running for about three hours. It's never had a second of unplanned downtime.

About 15 years ago I got sick of maintaining my own stuff. I stopped building Linux desktops and bought an Apple laptop. I moved my email, calendars, contacts, chat, photos, etc, to Google. But lately I've swung 180 degrees and have been undoing all those decisions. It's not as much of a PITA as I remember. Maybe I'm better at it now? Or maybe it will become a PITA and I'll swing right back.

EDIT: I realize you're talking in a commercial sense and I'm talking about a homelab sense. Still, take my anecdote for what it's worth. :D

woodruffw4y ago

Not my company, but I work with another company that does (nearly?) all of their infrastructure on premise. They have pretty great uptime, in a large part because they're not dependent on the 3-4 global state mechanisms that consistently cause outages with cloud providers (DNS, BGP, AWS's role management/control plane, &c.).

I think you're right about what we over- & under-estimate, but that we also under-estimate the inflection point for when it makes sense to begin relying on major cloud services. Put another way: we over-estimate our requirements, causing us to pessimistically reach for services that have problems that we'd otherwise never have.

autosharp4y ago

Also, you can just take two different amazon regions and hope they don't both go down at the same time.

For extra safety, and extra work, you could even take Azure as a backup if you're not locked in with AWS.

dijit4y ago

forgive me repeating myself: AWS Zones are not truly independent of each other.

Global services such as route53, Cognito, the default cloud console and Cloudfront are managed out of US-East-1.

If us-east-1 is unavailable, as is commonly the case, and you depend on those systems, you are also down.

it does not matter if you're in timbuktu-1, you are dead in the water.

it is a myth that amazon availability zones are truly independent.

please stop blaming the victim, because you can do everything right and still fail if you are not aware of this; and you are perpetuating that unawareness.

1 more reply

dgudkov4y ago

1) Can you make your on prem infrastructure go down less than Amazon's?

It's now hard to say how frequently Amazon's infrastructure goes down. The incident rate seems to have accelerated.

ocdtrekkie4y ago

My on prem infrastructure goes down drastically less than Amazon's.

...My home Internet even is scoring better than Amazon right now, in fact. Yours probably is too.

1 more reply

StreamBright4y ago

3) Could you hire talent that can build the thing?

In my experience problem number 3 is the hardest to solve.

jtc3314y ago

You’re missing a huge factor: agency.

fatnoah4y ago

My last startup migrated from Verizon Terremark after the healthcare.gov fiasco several years ago. We also suffered from that massive outage and that was the final straw in migrating to AWS.

At AWS, we built a few layers of redundant infrastructure with mulit-AZ availability within a region and then global availability across multiple regions. All this was done at roughly half the cost of the traditional hosting, even when including the additional person-hours required to maintain it on our end.

Keeping our infra simple helped that work, and it's literally been years since an outage caused by any AWS issues, even though there have been several large AWS events.

hinkley4y ago

Every time one of these conversations happen I end up thinking to myself that Oxide Computing needs three more competitors and a big pile of money.

AWS maintains a fiction of turnkey infrastructure, and the reality of building your own is so starkly different that I haven't seen an IT group for some time that could successfully push back on these sorts of discussions.

Building your own datacenter is still too much like maintaining a muscle car, fiddly bits and grease under your fingernails all the time, meanwhile the world has moved on, and we now have several options in soccer mom EVs that can challenge a classic Corvette in the quarter mile, and obliterate its 0-60-0 time. There is no Hyundai for the operations people, and there should be.

I don't know the physics of shipping such a thing, but I think we really do need to be able to buy a populated and pre-wired rack and slot it into the data center. Literally slot it in. If you've ever been curious about maritime shipping, you know that they have a system for securing containers to cranes, trailers, each other, and I don't see a reason you couldn't steal that same design for mounting a server rack to the floor. Other than the pins would need to be removable (eg, a bolt that screws into a threaded hole in the floor) so you don't trip on them.

In a word, we need to make physical servers fungible. There are any number of things that we need to do to get there, but I think we can. Honestly I'm surprised we haven't heard more of this sort of talk from Dell, especially after they bought VMWare. This just seems like a huge failure of imagination. Or maybe it's simply a revolution lacking a poster child. At this rate that 'child' has already been born, and we are just waiting to see who it is.

1 more reply

zymhan4y ago

Indeed, if you only deploy resources in us-east1, or any other single region, you're risking the occasional downtime.

I'd wager that will still give you more uptime than a physically-hosted solution for the same cost.

2 more replies

uvdn74y ago

You could have just showed them historical data of both companies being unavailable for extended amount of time. What happened in the past few months is not new.

joana0354y ago

"just", as if you never had to argument against aws fanboys...

1 more reply

whydoyoucare4y ago

It reminds me of the old adage: "Two is one, one is none. Have a backup. Always."

kburman4y ago

AWS or Google or any other reputable cloud provider are still far more better options then your local backup. Only way I see you losing your data is account getting locked.

davewritescode4y ago

You’re not wrong but there’s ways to do backups properly in AWS and I’m not aware of there ever being an incident where AWS has lost data.

It’s not a bad idea store backups offline but costs might make that an expensive proposition.

numbsafari4y ago

S3 isn't perfect. Read the fine print.

I've had buckets and objects disappear into the ether.

It is exceedingly rare, but it's not impossible.

Offline/alt-cloud backups are probably a lot cheaper than you think, and will win you points during any audit.

1 more reply

dookahku4y ago

Send Your former colleagues a group email asking how it is

jmartrican4y ago

Seems like multi-cloud solution might be the way to go.

thedougd4y ago

I doubt it. The complexity of multi-cloud will also give you downtime.

Most of the folks impacted by cloud outages do not have highly available systems in place. Perhaps, for their business, the cost doesn't justify the outcome.

If you need high uptime for instances, build your system to be highly available and leverage the fault domain constructs your provider offers (placement groups, availability zones, regions, load balancing, DNS routing, autoscaling groups, service discovery, etc). For instances, double down and use spot instance and maximum lifetimes in your groups so that you're continuously validating your application can recovery from instance interruptions.

If you're heavy on applications that leverage cloud APIs, such as is often the case with labmdas, then strongly consider multi-region active/active as API outages tend to cross AZ's and impact the entire region.

1 more reply

nier4y ago

All while making sure that these cloud solutions are not inter-dependent and that there are redundant paths to access these services.

hinkley4y ago

Have you contacted them to see how things are going?

Maybe a cheery note asking how the team is doing, sent right in the middle of an outage.

Passive aggressive? As hell. Cathartic? Damn skippy.

mattl4y ago

Backup to rsync.net

rafale4y ago

Did u file a complaint on the use of swear words?

xwdv4y ago

You’re still in the wrong, don’t be so smug. These few downtimes are no big deal in the grand scheme of things, and your proposed solution would have been more work and headaches for little to no realizable gains, and not to mention the cybersecurity ramifications. Quite frankly, they are probably glad that you’re gone and not around to gloat about every trivial bit of downtime.

locallost4y ago

They're not gloating and also not smug. There's not even a 'hehe' in the post.

CaptRon4y ago

At least HN works.

sydthrowaway4y ago

Switch to Azure

clavicat4y ago

How much more frequent do these outages need to become before it starts triggering SLA limits?

sh4un4y ago

Damn you all eggs in one basket.

j / k navigate · click thread line to collapse