> Over the course of the outage, we continued to monitor our service status, and worked with some of the affected users to narrow down the source of the problem.
> On July 14th, at 19:14 UTC we were able to identify that the problem was within our us-west3 region, which we then took offline, directing traffic to other nearby regions instead.
The time difference between when the first reports came in and when it was confirmed is a little concerning.
As an aside:
> ... approximately 18:00 UTC ...
> ... just over 24 hours ...
> ... For a period of around 24 hours, some users in the us-west3 region
> ... less than 30 minutes ...
> ... On July 13th, at around 18:45 UTC we started to receive reports of an outage from a small number of users. ...
"Approximately", "just", "around", "some", "small number of". It goes on and on. I disagree with the stylistic approach of being less specific in posts like these. A "small number of users" is relative. As readers, we have no idea what your typical load may be. Small may be a large number to us. "Just" over 24 hours is 26 hours? 24.5 hours? I implore you to be specific when you have the actual data.
These terms read as weasel words, and impact your effort at being fully transparent.
That is all to say, regardless of whether the post itself is “a little concerning,” it would be more concerning if the post didn’t even exist. And if you weren’t one of their affected customers, you likely wouldn’t even know this happened. So they did the right thing by publishing it and opening themselves to your criticism, which is a positive sign for the future of the platform IMO.
Our services are of what I consider medium complexity (~70 services, ~10 different "layers" of logic, db, caching, load balancing etc, AWS, mostly self-managed centralized logging and monitoring) but still quite low-volume (< 100 requests / second), and any more serious issue (let alone outage) is spontaneously met by my team as absolute emergency and typically fixed in < 10 minutes.
We're very modestly funded compared to Deno (in this example) and the team is small...
Not sure whether that changes with traffic volume, complexity, team size, or is more primarily attitude-based and should continue to be cultivated.
After we managed to successfully isolate the issue we were able to disable the region within 30 minutes, because we had an established protocol for how to do that.
Here is a more typical incident update for us: https://deno.com/blog/2022-05-30-outage-post-mortem
Part of the issue was also that we did not realize the scope of the issue right at the start of the incident, because our automated monitoring did not catch the dropped traffic.
All that is to say: the outage is obviously unacceptable, and sincerely apologize for it. We are working very hard to make sure nothing similar can occur again in the future.
As mentioned, I'm looking forward to continuing to follow Deno's progress and all the best in hardening your devops!
Unless they had the "route the whole region over another one" in their prepared and practiced DR procedure, it would take any team a significant time to get that planned, approved, implemented and tested.
If you're running something at tens of services scale and recovered in 10min, you're extremely lucky. I'd suggest that if you don't have risks on your list that will take hours to resolve, your list is not complete.
One alleviating circumstance is that, running on AWS, a big portion of such issues (ones that would take a lot of time to resolve) would come from wider AWS outages - when there's significant leeway - the old adage that customers / big part of web would have bigger issues than us being down if an entire AWS region (or multiple) is down.
In Deno's case, most of "those" parts are self-managed and surely much more difficult to keep running reliably.
In our case, 1 out of 5 LB instances lost its connection to the service discovery and later on ended up not knowing about a failover of one of the 5 backends for a service. As a result, something like 1 in 20 to 1 in 25 requests got answered with a connection refused. That took a minute to find.
- The load balancer lost its connection to etcd and did not reconnect
- The load balancer had no healthy backend and did not un-advertise itself
- The load balancer did not report either of those issues to monitoring
Honestly this is a little concerning. Are they using their own load-balancing software? If yes, why?
To your final question: yes, we using our own load balancing software. We are building a global hosting platform that needs to be able to run on bare metal servers, not an end user application where load balancing is an afterthought. As such we can not use much of the software that a "regular" SaaS application may be able to. Some constraints our system needs to be able to solve:
- Our load balancers handle routing to 100s of thousands of unique deployments (services), all of which need to be accessible and routeable within milliseconds of a request coming in.
- We need to terminate TLS connections for thousands of unique domains.
- We need to be able to carefully control TLS handshakes, to be able to prewarm downstream services for an imminent request for a given deployment based on a TLS client hello SNI, before even having received an HTTP request yet.
- The system needs to handle hundreds of millions of hourly requests.
- The system needs to be able to run on bare metal.
- We currently handle 34 regions globally (up from 28 at the start of the year), which means that all of the data needed to fulfill the above requirements needs to be accessible from all of our PoPs in a matter of milliseconds.
For many companies global load balancing is something they can outsource to AWS, GCP, or Cloudflare. For us, this is core "business logic" that we need to have full control over. It's difficult for us to outsource, and it's questionable if it would be wise for us to do so. Building new systems is obviously always a complex undertaking, and there will be some stumbling stones in the way, but they can be overcome. We are still bullish that our path is the right one, even if we still have a lot of work ahead.
(if this seems interesting, and you want to work with us on building load balancers, among other things: https://deno.com/jobs)
And there is no API monitoring apparently.
We are capable of learning from past mistakes though, and as such we'll make sure to add more monitoring for these kinds of scenarios so we can be alerted to a root cause earlier. We will do better.
LBs should also be alerting on health checks failures/no data for targets as well.
Visibility is the cause & lesson learned on duration. It's worth simply paying for 3P distributed RUM. Make sure you can get down to /24s & ASNs as well as breaking it out by (your) target destination/address. I reallly like TurboBytes in the past. Cedexis was ok, but I remember the API/raw data access to be bit of a pain.
It sounds like your TCP LB wasnt exporting metrics this time. For other cases you can get decent data out of the tcp metrics cache on linux. And proc has some good counters even before you get a socket; PAWSPASSIVEREJECTED may have bitten me before :( Make sure your reads of /proc/net/netstat are aligned to the right size if you go that route.
> ... because the load balancer that failed was very early in the network stack (a TCP load balancer). It does not record any diagnostics about dropped connections ...
You may be able to sort some improved visibility with something like netflow/sflow. This aligns well with discrete components and independent failure domains as well.
> Services announce themselves to the etcd cluster when their availability state changes ... If there are no healthy backends it will un-advertise itself from the network to prevent requests ending up at this "dead end".
In my experience you really can't rely on nodes to manage themselves when it comes to service availability or health. There are too many grey failure cases where a dataplane node will partially fail enough to keep mangling traffic or passing shallow health checks. eg a disk going read only or stalled IO can keep the LB and active data in memory up, signalling like BGP sessions stay up, but prevent consuming new system/customer state updates. A seperate system/component is necessary for teh control loop to be insulated from those failures.
You end up in a situation where the distributed LB has "data plane" workers that handle connections & packets while the out of band "control plane" determines health & controls BGP/routing/ARP/whatever to put the data plane nodes in or out of service. Your application/lb/etc data plane can still self report & retrieve data from etcd. But put the control somewhere with less correlated failures. While you're at it build data versioning in to your configuration, eg active customers/domains/etc, that your dataplane uses & reports. That way your control plane can check both the availability/performance and the current working state of LB dataplane configuration.
> [The LB did not have] any healthy backends to direct traffic to. ... This caused the traffic to be dropped entirely.
Throwing a RST or similar here is not wrong per se, and is a nice clear failure mode. One other approach is to have something like a default route that you can punt traffic to (and alert) as a last resort. It depends on your network/LB configuration but this could be a common MAC address, an internal ECMP'd route, or similar. I think you'll see many services that build L3/4 LBs, like CDNs, take this approach. IIRC google maglev and fastly document their take on this to deal with problems like IP fragments and MTU discovery where some packets dont flow with the rest of teh 5 tuple.
> The region will remain disabled until our monitoring has improved and the issue has been fixed more permanently.
I understand if this choice is around business & customer confidence. However I didnt see anything that indicated your failure modes were specific to us-west3. It seemed to be that visibility & detection were the real failure. And in that case I'd posit the better path is getting global visibility in to your failure mode, deploying that first/early to us-west3 and use that as your gate.
edit: Im a couple years past doing distributed networking/lb systems as my full time job, so apologies if this is dated/fuzzy advice.
I have retroactively added the outage to the status page now: https://denostatus.com/cl5ob2i5s943266vk890ushwov.
'"JUST" over 24 hours', no big deal of course /s
https://hn.algolia.com/?dateRange=all&page=0&prefix=true&que...
no mention of that issue
either nobody uses Deno so 0 complains
or people use Deno and for some reasons 24h+ downtime didn't impact anybody, wich is surprising, to say the least
> On July 14th, at 19:14 UTC we were able to identify that the problem was within our us-west3 region
So yeah, 24 hours and 29 minutes.