Deno’s July 13th incident update (opens in new tab)

(deno.com)

110 pointsmostafah3y ago43 comments

43 comments

ctvo3y ago

> On July 13th, at around 18:45 UTC we started to receive reports of an outage from a small number of users. We investigated the status of our services, but were unable to confirm any of the reports. All of our status monitoring and tests reported that everything was operating normally.

> Over the course of the outage, we continued to monitor our service status, and worked with some of the affected users to narrow down the source of the problem.

> On July 14th, at 19:14 UTC we were able to identify that the problem was within our us-west3 region, which we then took offline, directing traffic to other nearby regions instead.

The time difference between when the first reports came in and when it was confirmed is a little concerning.

As an aside:

> ... approximately 18:00 UTC ...

> ... just over 24 hours ...

> ... For a period of around 24 hours, some users in the us-west3 region

> ... less than 30 minutes ...

> ... On July 13th, at around 18:45 UTC we started to receive reports of an outage from a small number of users. ...

"Approximately", "just", "around", "some", "small number of". It goes on and on. I disagree with the stylistic approach of being less specific in posts like these. A "small number of users" is relative. As readers, we have no idea what your typical load may be. Small may be a large number to us. "Just" over 24 hours is 26 hours? 24.5 hours? I implore you to be specific when you have the actual data.

These terms read as weasel words, and impact your effort at being fully transparent.

TechBro86153y ago

This is a brand new platform and I assume this blog post is one of their first post-mortems. I’m inclined to give them a break, and respect them for establishing a process of post-mortems.

That is all to say, regardless of whether the post itself is “a little concerning,” it would be more concerning if the post didn’t even exist. And if you weren’t one of their affected customers, you likely wouldn’t even know this happened. So they did the right thing by publishing it and opening themselves to your criticism, which is a positive sign for the future of the platform IMO.

steve_adams_863y ago

In this case it’s disappointing especially because we know they have exact numbers.

geysersam3y ago

Arguably the exact numbers aren't interesting either. 456 982 requests from 5674 projects by 3091 user accounts were dropped during the 23:53:42.104 hours the outage lasted. I think it makes sense to make an interpretation and present that instead of raw numbers.

ctvo3y ago

No one needs to know the raw numbers. That may be a little more helpful than no numbers but that could also be internal data they’re not comfortable sharing, and that’s fine. There are numerous other ways to write this that would have communicated the scale of the impact more rigorously, but not deal in absolute, raw numbers.

For example, by using percentages: This outage impacted 1.5% of customer requests vs. a “small” amount.

steve_adams_863y ago

Good point. I disagree for my own preference but I don’t doubt that you’re right that many people would rather get the gist of the issue. I’m a numbers person but I know many aren’t (which is perfectly fine — neither way is inherently better).

TechBro86153y ago

I think there are some investors out there who might find these numbers more interesting than most numbers they will see today.

aslilac3y ago

We actually don’t, hence the vague language. As mentioned at the bottom, this incident really revealed some wholes in the insights we have. I think it’s also important to remember that Deploy is still in public beta, specifically because we are aware that it’s not perfect yet.

ctvo3y ago

It's not clear to the reader what metrics you have and what you do not. Reading the post top to bottom, when we get to this line at the end:

> This incident has made it clear that a few blindspots exist within our monitoring systems.

I assume it's related to metrics at the load balancer layer not tracking that it's failing to forward the traffic downstream to registered hosts.

> For a period of around 24 hours, some users in the us-west3 region were unable to access dash.deno.com, and Deno Deploy projects, including deno.com and deno.land.

This earlier sentence implies to me that you do have an idea of the scale of the outage. If you instead meant "some" to be "the users who reported they were impacted directly to Deno" and not a metric you have access to, you should just say it.

For example:

We are unable to determine the impact of the outage to customers in the us-west3 region outside of those who reported the issue to us directly.

There's less ambiguity there.

> I think it’s also important to remember that Deploy is still in public beta, specifically because we are aware that it’s not perfect yet.

No system is perfect. There will always be outages. Taking a more rigorous and transparent approach to postmortems isn't related to your system being perfect.

I'm circling back to this days old thread to give this feedback in the spirit of constructive criticism. It's not meant to embarrass or call anyone out specifically, and I hope you find it helpful.

1 more reply

steve_adams_863y ago

Okay, thanks for the correction. I appreciate the transparency about there being holes, and I totally get that it’s a beta as well. I know what you’re doing isn’t easy either — there’s plenty of room for mistakes and learning.

alluro23y ago

I don't mean anything bad to Deno's team (I'm very partial to what they're building), but I'm rather surprised whenever a widely-publicized service has an outage that lasts hours or more than 24h. I'm genuinely curious to understand whether it's typically due to complexity of infrastructure and how hard it is to find route causes, how long it takes to redirect traffic / patch temporarily when the cause is found, or is it due to attitude where it's considered normal for these things to happen, and to take time to solve step by step.

Our services are of what I consider medium complexity (~70 services, ~10 different "layers" of logic, db, caching, load balancing etc, AWS, mostly self-managed centralized logging and monitoring) but still quite low-volume (< 100 requests / second), and any more serious issue (let alone outage) is spontaneously met by my team as absolute emergency and typically fixed in < 10 minutes.

We're very modestly funded compared to Deno (in this example) and the team is small...

Not sure whether that changes with traffic volume, complexity, team size, or is more primarily attitude-based and should continue to be cultivated.

lucacasonato3y ago

Our issue here was very much in finding the root cause. Because the failed traffic was “black holed” (TCP connections were being dropped), we had very little information other than “it isn’t working” from the users that reported the issue. This caused us significant headaches in trying to figure out what the commonality between the incident reports of our users was (the geo region). Up until the point this was clear, we were also checking database clusters, DNS configurations, TLS certificates etc to try to isolate the issue.

After we managed to successfully isolate the issue we were able to disable the region within 30 minutes, because we had an established protocol for how to do that.

Here is a more typical incident update for us: https://deno.com/blog/2022-05-30-outage-post-mortem

Part of the issue was also that we did not realize the scope of the issue right at the start of the incident, because our automated monitoring did not catch the dropped traffic.

All that is to say: the outage is obviously unacceptable, and sincerely apologize for it. We are working very hard to make sure nothing similar can occur again in the future.

alluro23y ago

Thanks for the insight - I definitely wasn't trying to dump on the team or handling of the issue - really just understand better so I have more awareness and can hopefully help my team (as a young CTO) be more prepared for different types of challenges.

As mentioned, I'm looking forward to continuing to follow Deno's progress and all the best in hardening your devops!

viraptor3y ago

> is spontaneously met by my team as absolute emergency and typically fixed in < 10 minutes

Unless they had the "route the whole region over another one" in their prepared and practiced DR procedure, it would take any team a significant time to get that planned, approved, implemented and tested.

If you're running something at tens of services scale and recovered in 10min, you're extremely lucky. I'd suggest that if you don't have risks on your list that will take hours to resolve, your list is not complete.

alluro23y ago

That's a fair point and a good suggestion to consider.

One alleviating circumstance is that, running on AWS, a big portion of such issues (ones that would take a lot of time to resolve) would come from wider AWS outages - when there's significant leeway - the old adage that customers / big part of web would have bigger issues than us being down if an entire AWS region (or multiple) is down.

In Deno's case, most of "those" parts are self-managed and surely much more difficult to keep running reliably.

jdlshore3y ago

I’m always curious to learn about why people create complex architectures. It’s off-topic, but why so much complexity for such a low volume?

donavanm3y ago

A lot of this can be from (un)intentionally trying to maintain separation of responsibilities between different teams or developers. Decoupling, interfaces, etc all add up and pretty soon you start building based on what's already done vs where you originally intended to go. And I don't think that's a poor choice; nine women can't have a baby in one month, but they can have nine in nine months (to butcher an old saying).

tetha3y ago

Ew, we've had similar issues in the past. These are really messy and confusing to recognize.

In our case, 1 out of 5 LB instances lost its connection to the service discovery and later on ended up not knowing about a failover of one of the 5 backends for a service. As a result, something like 1 in 20 to 1 in 25 requests got answered with a connection refused. That took a minute to find.

nijave3y ago

Had something similar when a k8s node broke but k8s thought the pods (envoy) on it were still running so it routed 1/nth of traffic into a black hole

sentrms3y ago

Is anyone running mission critical software on this rather new platform? In my experience every added cloud service becomes another potential weak link in your chain. A distributed DB for your data, a CDN for static assets, a couple of lambda functions for background processing. With every move away from the monolith your surface for potential downtime or "elevated error rates" increases.

remram3y ago

So three failures:

- The load balancer lost its connection to etcd and did not reconnect

- The load balancer had no healthy backend and did not un-advertise itself

- The load balancer did not report either of those issues to monitoring

Honestly this is a little concerning. Are they using their own load-balancing software? If yes, why?

lucacasonato3y ago

The system does have mitigation against the first two failures in isolation (as described in the post). The mitigations did not work correctly in this scenario with the combined failures unfortunately. This is obviously unexcusable, and we need to do better in the future.

To your final question: yes, we using our own load balancing software. We are building a global hosting platform that needs to be able to run on bare metal servers, not an end user application where load balancing is an afterthought. As such we can not use much of the software that a "regular" SaaS application may be able to. Some constraints our system needs to be able to solve:

- Our load balancers handle routing to 100s of thousands of unique deployments (services), all of which need to be accessible and routeable within milliseconds of a request coming in.

- We need to terminate TLS connections for thousands of unique domains.

- We need to be able to carefully control TLS handshakes, to be able to prewarm downstream services for an imminent request for a given deployment based on a TLS client hello SNI, before even having received an HTTP request yet.

- The system needs to handle hundreds of millions of hourly requests.

- The system needs to be able to run on bare metal.

- We currently handle 34 regions globally (up from 28 at the start of the year), which means that all of the data needed to fulfill the above requirements needs to be accessible from all of our PoPs in a matter of milliseconds.

For many companies global load balancing is something they can outsource to AWS, GCP, or Cloudflare. For us, this is core "business logic" that we need to have full control over. It's difficult for us to outsource, and it's questionable if it would be wise for us to do so. Building new systems is obviously always a complex undertaking, and there will be some stumbling stones in the way, but they can be overcome. We are still bullish that our path is the right one, even if we still have a lot of work ahead.

(if this seems interesting, and you want to work with us on building load balancers, among other things: https://deno.com/jobs)

donavanm3y ago

I've helped build and run similar distributed systems with deep load balancing and network interactions. There are some packages out there that do bits and pieces of the problem. I don't know of anything COTS that is a suitable like for like replacement of the core components, much less a suitable system . On top of that, as Luca mentioned, you almost always get in to deep interactions between L3/4/5/7 and end up building bespoke logic that's tailored to the business or application needs. A trivial example would be the coupling from IP address assignment/announcement, to TLS cert, to SNI headers, to active customers, to application instance routing.

lowwave3y ago

instead of the AWS ones?

remram3y ago

Yes or even a more turn-key software package. It sounds like they had very custom software, I would expect that established load-balancing software doesn't fail to reconnect.

AtNightWeCode3y ago

"... (a TCP load balancer). It does not record any diagnostics about dropped connections, nor does it have a return channel to return diagnostic information to the user (unlike HTTP loadbalancers, which can return a response header)."

And there is no API monitoring apparently.

lucacasonato3y ago

A bit of a blunt statement on my part. There is monitoring on a multitude of other connection related issues (eg TLS handshake failures, missing SNI, etc). We should have had monitoring for this specific failure where the load balancer did not have any healthy backends, but as mentioned in the post, the load balancer was programmed in way that this should never have been able to happen in the first place (as the LB should have un-advertised itself if there are no unhealty backends).

We are capable of learning from past mistakes though, and as such we'll make sure to add more monitoring for these kinds of scenarios so we can be alerted to a root cause earlier. We will do better.

AtNightWeCode3y ago

API Monitoring is the practice of making calls to an API to check it. Live end-to-end tests. We do at least ping for every API in every region. Still hard to pinpoint these issues sometimes.

turtlebits3y ago

Seem like a huge gap in observability - Low/zero healthy targets for a load balancer should be a P0/critical alert, especially when traffic is getting black holed.

LBs should also be alerting on health checks failures/no data for targets as well.

donavanm3y ago

Hey Luca, some thoughts from working on similar systems.

Visibility is the cause & lesson learned on duration. It's worth simply paying for 3P distributed RUM. Make sure you can get down to /24s & ASNs as well as breaking it out by (your) target destination/address. I reallly like TurboBytes in the past. Cedexis was ok, but I remember the API/raw data access to be bit of a pain.

It sounds like your TCP LB wasnt exporting metrics this time. For other cases you can get decent data out of the tcp metrics cache on linux. And proc has some good counters even before you get a socket; PAWSPASSIVEREJECTED may have bitten me before :( Make sure your reads of /proc/net/netstat are aligned to the right size if you go that route.

> ... because the load balancer that failed was very early in the network stack (a TCP load balancer). It does not record any diagnostics about dropped connections ...

You may be able to sort some improved visibility with something like netflow/sflow. This aligns well with discrete components and independent failure domains as well.

> Services announce themselves to the etcd cluster when their availability state changes ... If there are no healthy backends it will un-advertise itself from the network to prevent requests ending up at this "dead end".

In my experience you really can't rely on nodes to manage themselves when it comes to service availability or health. There are too many grey failure cases where a dataplane node will partially fail enough to keep mangling traffic or passing shallow health checks. eg a disk going read only or stalled IO can keep the LB and active data in memory up, signalling like BGP sessions stay up, but prevent consuming new system/customer state updates. A seperate system/component is necessary for teh control loop to be insulated from those failures.

You end up in a situation where the distributed LB has "data plane" workers that handle connections & packets while the out of band "control plane" determines health & controls BGP/routing/ARP/whatever to put the data plane nodes in or out of service. Your application/lb/etc data plane can still self report & retrieve data from etcd. But put the control somewhere with less correlated failures. While you're at it build data versioning in to your configuration, eg active customers/domains/etc, that your dataplane uses & reports. That way your control plane can check both the availability/performance and the current working state of LB dataplane configuration.

> [The LB did not have] any healthy backends to direct traffic to. ... This caused the traffic to be dropped entirely.

Throwing a RST or similar here is not wrong per se, and is a nice clear failure mode. One other approach is to have something like a default route that you can punt traffic to (and alert) as a last resort. It depends on your network/LB configuration but this could be a common MAC address, an internal ECMP'd route, or similar. I think you'll see many services that build L3/4 LBs, like CDNs, take this approach. IIRC google maglev and fastly document their take on this to deal with problems like IP fragments and MTU discovery where some packets dont flow with the rest of teh 5 tuple.

> The region will remain disabled until our monitoring has improved and the issue has been fixed more permanently.

I understand if this choice is around business & customer confidence. However I didnt see anything that indicated your failure modes were specific to us-west3. It seemed to be that visibility & detection were the real failure. And in that case I'd posit the better path is getting global visibility in to your failure mode, deploying that first/early to us-west3 and use that as your gate.

edit: Im a couple years past doing distributed networking/lb systems as my full time job, so apologies if this is dated/fuzzy advice.

FBISurveillance3y ago

Hugops to the team. A quick question: is it intentional that there's nothing on https://denostatus.com/?

jmarneweck3y ago

They mentioned that their automated monitoring did not pick up the issue and that they are working on improving the tooling around this.

lucacasonato3y ago

No, it wasn't intentional. Because the incident was not triggered by automation, the incident on the status page was not automatically created. The team did not remember to update the status page while we were investigating the issue. Sorry for that!

I have retroactively added the outage to the status page now: https://denostatus.com/cl5ob2i5s943266vk890ushwov.

FBISurveillance3y ago

Happens to the best of us :hug: You folks are working on a great product!

weird-eye-issue3y ago

What would be the benefit for Deno to have an accurate status page? It would only stand to detract potential customers/investors

quickthrower23y ago

They are establishing their brand, transparency, ethics and trust is what should be part of that for what is the next Heroku/Digital Oceam.

weird-eye-issue3y ago

You guys took my comment way too literally. It was tongue in cheek :P

Shadonototra3y ago

> several services provided by the Deno company experienced a service disruptions in our us-west3 region for a period of just over 24 hours.

'"JUST" over 24 hours', no big deal of course /s

https://hn.algolia.com/?dateRange=all&page=0&prefix=true&que...

no mention of that issue

either nobody uses Deno so 0 complains

or people use Deno and for some reasons 24h+ downtime didn't impact anybody, wich is surprising, to say the least

atwood223y ago

“Just over” 24 hours. As in slightly more than 24 hours. Which is different than “just” 24 hours.

capableweb3y ago

> On July 13th, at around 18:45 UTC we started to receive reports of an outage

> On July 14th, at 19:14 UTC we were able to identify that the problem was within our us-west3 region

So yeah, 24 hours and 29 minutes.

j / k navigate · click thread line to collapse

43 comments

ctvo3y ago

> Over the course of the outage, we continued to monitor our service status, and worked with some of the affected users to narrow down the source of the problem.

> On July 14th, at 19:14 UTC we were able to identify that the problem was within our us-west3 region, which we then took offline, directing traffic to other nearby regions instead.

The time difference between when the first reports came in and when it was confirmed is a little concerning.

As an aside:

> ... approximately 18:00 UTC ...

> ... just over 24 hours ...

> ... For a period of around 24 hours, some users in the us-west3 region

> ... less than 30 minutes ...

> ... On July 13th, at around 18:45 UTC we started to receive reports of an outage from a small number of users. ...

These terms read as weasel words, and impact your effort at being fully transparent.

TechBro86153y ago

This is a brand new platform and I assume this blog post is one of their first post-mortems. I’m inclined to give them a break, and respect them for establishing a process of post-mortems.

steve_adams_863y ago

In this case it’s disappointing especially because we know they have exact numbers.

geysersam3y ago

ctvo3y ago

For example, by using percentages: This outage impacted 1.5% of customer requests vs. a “small” amount.

steve_adams_863y ago

TechBro86153y ago

I think there are some investors out there who might find these numbers more interesting than most numbers they will see today.

aslilac3y ago

ctvo3y ago

It's not clear to the reader what metrics you have and what you do not. Reading the post top to bottom, when we get to this line at the end:

> This incident has made it clear that a few blindspots exist within our monitoring systems.

I assume it's related to metrics at the load balancer layer not tracking that it's failing to forward the traffic downstream to registered hosts.

> For a period of around 24 hours, some users in the us-west3 region were unable to access dash.deno.com, and Deno Deploy projects, including deno.com and deno.land.

For example:

We are unable to determine the impact of the outage to customers in the us-west3 region outside of those who reported the issue to us directly.

There's less ambiguity there.

> I think it’s also important to remember that Deploy is still in public beta, specifically because we are aware that it’s not perfect yet.

No system is perfect. There will always be outages. Taking a more rigorous and transparent approach to postmortems isn't related to your system being perfect.

I'm circling back to this days old thread to give this feedback in the spirit of constructive criticism. It's not meant to embarrass or call anyone out specifically, and I hope you find it helpful.

1 more reply

steve_adams_863y ago

alluro23y ago

We're very modestly funded compared to Deno (in this example) and the team is small...

Not sure whether that changes with traffic volume, complexity, team size, or is more primarily attitude-based and should continue to be cultivated.

lucacasonato3y ago

After we managed to successfully isolate the issue we were able to disable the region within 30 minutes, because we had an established protocol for how to do that.

Here is a more typical incident update for us: https://deno.com/blog/2022-05-30-outage-post-mortem

Part of the issue was also that we did not realize the scope of the issue right at the start of the incident, because our automated monitoring did not catch the dropped traffic.

All that is to say: the outage is obviously unacceptable, and sincerely apologize for it. We are working very hard to make sure nothing similar can occur again in the future.

alluro23y ago

As mentioned, I'm looking forward to continuing to follow Deno's progress and all the best in hardening your devops!

viraptor3y ago

> is spontaneously met by my team as absolute emergency and typically fixed in < 10 minutes

alluro23y ago

That's a fair point and a good suggestion to consider.

In Deno's case, most of "those" parts are self-managed and surely much more difficult to keep running reliably.

jdlshore3y ago

I’m always curious to learn about why people create complex architectures. It’s off-topic, but why so much complexity for such a low volume?

donavanm3y ago

tetha3y ago

Ew, we've had similar issues in the past. These are really messy and confusing to recognize.

nijave3y ago

Had something similar when a k8s node broke but k8s thought the pods (envoy) on it were still running so it routed 1/nth of traffic into a black hole

sentrms3y ago

remram3y ago

So three failures:

- The load balancer lost its connection to etcd and did not reconnect

- The load balancer had no healthy backend and did not un-advertise itself

- The load balancer did not report either of those issues to monitoring

Honestly this is a little concerning. Are they using their own load-balancing software? If yes, why?

lucacasonato3y ago

- Our load balancers handle routing to 100s of thousands of unique deployments (services), all of which need to be accessible and routeable within milliseconds of a request coming in.

- We need to terminate TLS connections for thousands of unique domains.

- The system needs to handle hundreds of millions of hourly requests.

- The system needs to be able to run on bare metal.

(if this seems interesting, and you want to work with us on building load balancers, among other things: https://deno.com/jobs)

donavanm3y ago

lowwave3y ago

instead of the AWS ones?

remram3y ago

Yes or even a more turn-key software package. It sounds like they had very custom software, I would expect that established load-balancing software doesn't fail to reconnect.

AtNightWeCode3y ago

And there is no API monitoring apparently.

lucacasonato3y ago

We are capable of learning from past mistakes though, and as such we'll make sure to add more monitoring for these kinds of scenarios so we can be alerted to a root cause earlier. We will do better.

AtNightWeCode3y ago

API Monitoring is the practice of making calls to an API to check it. Live end-to-end tests. We do at least ping for every API in every region. Still hard to pinpoint these issues sometimes.

turtlebits3y ago

Seem like a huge gap in observability - Low/zero healthy targets for a load balancer should be a P0/critical alert, especially when traffic is getting black holed.

LBs should also be alerting on health checks failures/no data for targets as well.

donavanm3y ago

Hey Luca, some thoughts from working on similar systems.

> ... because the load balancer that failed was very early in the network stack (a TCP load balancer). It does not record any diagnostics about dropped connections ...

You may be able to sort some improved visibility with something like netflow/sflow. This aligns well with discrete components and independent failure domains as well.

> [The LB did not have] any healthy backends to direct traffic to. ... This caused the traffic to be dropped entirely.

> The region will remain disabled until our monitoring has improved and the issue has been fixed more permanently.

edit: Im a couple years past doing distributed networking/lb systems as my full time job, so apologies if this is dated/fuzzy advice.

FBISurveillance3y ago

Hugops to the team. A quick question: is it intentional that there's nothing on https://denostatus.com/?

jmarneweck3y ago

They mentioned that their automated monitoring did not pick up the issue and that they are working on improving the tooling around this.

lucacasonato3y ago

I have retroactively added the outage to the status page now: https://denostatus.com/cl5ob2i5s943266vk890ushwov.

FBISurveillance3y ago

Happens to the best of us :hug: You folks are working on a great product!

weird-eye-issue3y ago

What would be the benefit for Deno to have an accurate status page? It would only stand to detract potential customers/investors

quickthrower23y ago

They are establishing their brand, transparency, ethics and trust is what should be part of that for what is the next Heroku/Digital Oceam.

weird-eye-issue3y ago

You guys took my comment way too literally. It was tongue in cheek :P

Shadonototra3y ago

> several services provided by the Deno company experienced a service disruptions in our us-west3 region for a period of just over 24 hours.

'"JUST" over 24 hours', no big deal of course /s

https://hn.algolia.com/?dateRange=all&page=0&prefix=true&que...

no mention of that issue

either nobody uses Deno so 0 complains

or people use Deno and for some reasons 24h+ downtime didn't impact anybody, wich is surprising, to say the least

atwood223y ago

“Just over” 24 hours. As in slightly more than 24 hours. Which is different than “just” 24 hours.

capableweb3y ago

> On July 13th, at around 18:45 UTC we started to receive reports of an outage

> On July 14th, at 19:14 UTC we were able to identify that the problem was within our us-west3 region

So yeah, 24 hours and 29 minutes.

j / k navigate · click thread line to collapse