DNS Outage at DigitalOcean (opens in new tab)

(status.digitalocean.com)

120 pointsfinne10y ago121 comments

121 comments

People hating on DO "I'm losing thousands every hour". Well then should have had some failover in place if its that valuable.

[1]https://twitter.com/rodrigoespinosa/status/71303563702097100...

zackelan10y ago

For an example of this taken to a ludicrous extreme, several years ago an AWS user complained that downtime of a few EC2 instances were putting lives at risk. They were hosting cardiac monitoring services on single EC2 instances with no multi-AZ or multi-region capability.

AWS forum post: https://forums.aws.amazon.com/thread.jspa?threadID=65649&tst...

Previous HN discussion: https://news.ycombinator.com/item?id=2477345

crisopolis10y ago

I've been reading all the comments on Twitter also... like "err mai gawd I'm switching to AWS because of this" and your failure to not have a secondary DNS provider, but I highly doubt you'd switch.

Then another... "Today's @digitalocean DNS #outage is a reminder to not trust your entire business to one provider. Spread the love around!"

If your company is e-commerce and makes money by being 99.99% available. It's your own fault for no fail-over.

another... ".@digitalocean that's two hours without DNS now...my company's websites could be losing thousands of £ in e-commerce! Please, an update!"

codegeek10y ago

Times like this makes you realize the difference betweeen good clients and bad clients. Yes, they have a right to be upset but claims like "could be losing thousands of dollars" is mostly exagerrated due to their frustration.

crisopolis10y ago

heh yeah, I just laugh at all the tweets saying their losing $billions of dollars every minute their site/app is unavailable. All I can think is... if you're the next Amazon.com I'm pretty sure you'd have some type of disaster plan in place should something like this happen.

colinbartlett10y ago

I can't disagree with what you're saying, but I think we are all guilty of this. We expect more out of big name services than might be reasonable. (100% uptime)

How many of us here have failover email services in case Gmail goes down? I think many companies would say they'd lose thousands in productivity if Google Apps suffers an outage yet I'd hazard that very few have failover plans.

tracker110y ago

That's because people like to complain... the reality is stuff happens, systems go down, and life tends to go on.

Yeah, if your building full of employees can't work because the internet is down, and the secondary is also down, then that's kind of crappy, and you may be paying people to twiddle their thumbs... much short of that, it's kind of the cost of doing business...

There are redundancy options for a lot of things... If you're using only a single host provider for your infrastructure, and management scoffs at creating redundant, and under-utilized systems... it's not as "mission critical" as people think/say.

IgorPartola10y ago

I use dns.he.net for my DNS hosting. It's free up to 50 zones and has been rock solid. The other day I started having some trouble with accessing some of my domain names. Turned out that all of their DNS was down and was returning NXDOMAIN for pretty much any request, including their own domains. Oops. So I emailed their support (which is usually very quick to respond and is better than I have seen with lots of paid products). Well, it then occurred to me that I will not get a response since the MX records for my domain were also hosted with them. Double oops.

On the plus side, in the past 4 years that I've used them this was the very first issue, and they fixed it within a couple of hours.

Anyone have any good recommendations on cheap or free backup DNS hosting?

nullrouted10y ago

Use www.dnsmadeeasy.com and then dns.he.net as your secondary dns service or vice-versa. They will do transfers/updates from each other and work just fine.

cleaver10y ago

I had one customer on DO DNS and it was a "good enough" solution. Unfortunately, this came right in the middle of a marketing push for last-minute registrations. An annoyance, but not a major financial impact. (Maybe it will give the impression of excess demand. :)

I understand that things break and I should be ready for it. What I found unacceptable were the status updates. Basically, "we're working on it". No clue as to what was going on. A DDoS? Not a DDoS? Routing issues? Corrupt zone files? No clue? Any of those would be helpful as I needed to figure out if I should wait it out, or switch to Route 53.

In the information vacuum, I switched to Route 53. It works.

stevekemp10y ago

Route53 is a wonderful service, which has never had a global outage (so far).

I've wrapped it in git, to allow very quick updates to be made in a simple fashion:

https://dns-api.com/

scurvy10y ago

How do you fail over your SOA on .com when the minimum TTL is 1 day?

sfilipov10y ago

The answer probably is not very helpful but... change the TTL to a lower value? Is it essential to have such a big TTL of 1 day?

scurvy10y ago

Sure, have fun convincing Verisign. They're the ones controlling the .com registry. Your TTL values are meaningless to them when it comes to SOA. Minimum 1 day TTL.

1 more reply

chronid10y ago

DNS is hard. Very hard.

It may seems trivial when it works (hint: it's not), but some of the biggest fuck ups I've seen in my professional life were caused by strange DNS things happening or DNS servers going kaboom.

I feel the pain of the DO engineers trying to mitigate this issue. I really do.

johansch10y ago

BS. DNS is a trivial thing to scale, compared to most other web-scale efforts.

Things break when people don't use 20 year old best practices. There is no defense against inexperience and ignorance.

takeda10y ago

I took the OPs comment as "it's hard to understand DNS and biggest fuck ups happen because people think they understand DNS when they actually don't".

The problem with DNS is that it can work even when it is configured incorrectly. This makes people who has no idea what they are doing that they actually understand it. The strange issues with DNS only happen with strange configurations. When you follow best practices everything is predictable.

johansch10y ago

All right. This I can agree with.

thyrsus10y ago

Please help the ignorant and provide a link to a description of those best practices.

dividuum10y ago

> I feel the pain of the DO engineers trying to mitigate this issue. I really do.

Me too. Just last week they had another problem with DNS on the client side of things: Resolving with the Google Public DNS, which most droplets use by default, didn't work reliably. I hope that they post a combined post mortem for both of those incidents.

Thaxll10y ago

It's not hard, the problem is everything relies on DNS so when DNS goes down or has problems you have cascading failure.

bitJericho10y ago

That's why you use multiple providers.

4 more replies

tyingq10y ago

One thing hosting providers could do better would be to split the risk a little by not handing the same dns server name to every client that chooses to have the hosting provider supply dns services.

The reason this might have some upside is that DDOS attacks against a specific DNS server are often intended to target one specific customer of a hosting provider. The attacker doesn't care about the side effects...just the original target.

Say, for example "controversialblog.com" is hosted on DO, and uses DO dns servers. The person attacking "controversialblog.com" looks up the NS records for the domain, and attacks that DNS server. The fact that it's one hostname that serves all of DO is of little interest to the attacker.

So, if DO would come up with say, 10 separate hostnames they could hand out, then this sort of thing would take down 10% of their customers instead of 100%.

traviswingo10y ago

Yeah this is pretty unfortunate. We have some big investor meetings today and this unfortunately took our marketing site offline. Hopefully they resolve this soon - it's the first time we've ever experienced an issue with their service.

We really need fail-overs in place...small team problems.

brndn10y ago

If you are showing a demo or something, you can still navigate to your DO IP address. Of course, I don't know if other things (images, etc.) on your website also rely on their DNS.

Maakuth10y ago

For that kind of emergency, you can point the domain name to the IP address via /etc/hosts (there's a similar file in Windows as well).

tracker110y ago

c:\windows\system32\drivers\etc\hosts

defenestration10y ago

We feel the pain as well as our platform is unreachable. I'm now using an other DNS server and changed the nameserver in the domain-record. However the DNS propagation is taking some time. What are you doing at the moment as fail-over?

pbhjpbhj10y ago

Sorry if I'm trying to "teach grandma to suck eggs" but can't you just enter the domain in your local hosts file. If it's a network that needs access then presumably you have some sort of proxy/cache that could be seeded with the necessary domain+IP pairing? I suppose these aren't possible if you're trying to demo on someone else's network or in a public space or such.

traviswingo10y ago

We just switched it over to Route53 and set up some fail-overs there. Took us 5 mins and we're back online.

Looks like DO is still offline so it seems to have been a good call...

1 more reply

sashk10y ago

My rule: provider should do single thing:

- Hosting provider - host sites

- vps/cloud provider - provide VMs

- domain registrar - domain related stuff, but not DNS

- dns provider - host dns

- second dns provider - host dns in case first dns provider fails

So many DNS outages recently and all my projects are up.

copperx10y ago

Does Amazon's Route 53 count as a DNS provider, or do you treat it a hosting provider?

sashk10y ago

For me - neither.

But if I'd be tied in into Amazon's cloud infrastructure, I would have to use many of their features going against my rules above.

ludbb10y ago

How do you apply your rules considering what's available today? Which services are you using? It sounds like it would be a big headache to orchestrate the automation among all these different providers.

nlivingstone10y ago

Have multiple VMs @ Digital Ocean (TOR1), we use Cloudflare for DNS... All site have remained available and successfully fulfilling requests.

cleaver10y ago

Every site where I was using external DNS stayed up.

fredophile10y ago

I don't have anything more important than a small personal website but now I'm curious. If you set up a system to handle your main DNS provider failing, how do you test it? Is there a good reference where I can find some best practices on this?

mrideout10y ago

Here's my testing recommendation:

1. Pick some subset of your DNS records to monitor, or all of them if you want to be extra thorough. If you are picking a subset, then I'd pick whatever records are most critical to your business.

2. Setup monitoring that queries each of your authoritative name servers for each of the records that you identified in the previous step. The monitoring should notify you if any of the name servers are unresponsive, or return a different response than what's expected.

If you'd like to dig into the details of DNS, then O'Reilly's "DNS and BIND" is highly recommended, even if you're not using BIND.

There are a number of quality hosting providers out there. A rule of thumb that I use is this: If a DNS hosting provider doesn't eat their own dog food, don't trust them to handle your DNS. Digital Ocean doesn't use their own name servers for their main website's domain. Neither does Amazon.

Shameless plug: I created a DNS monitoring service that can be used used for monitoring each of your name servers: https://www.dnscheck.co/

tonyarkles10y ago

The flipside to the dog food point: if DigitalOcean did use their own nameservers for their main site, then we wouldn't have been able to see the status page.

mrideout10y ago

Good point!

Rezo10y ago

Their status page at https://status.digitalocean.com is also now giving an intermittent "500 Internal Server Error" nginx error, probably from the load. That's why you should use a service like https://www.statuspage.io for your important stuff, even though creating a status page is a fun side-project for a dev team.

crisopolis10y ago

So what you're saying is that instead of running their own Status Page on their own infrastructure that's reachable. They should outsource it to statuspage.io and pay another company to do it?

Rezo10y ago

Yes, that's pretty standard. Availability monitoring and status reporting should be external and separate from your own infrastructure, otherwise neither may be available when you need it the most.

dsr_10y ago

And don't use statuspage.io if your host is AWS, because theirs is too.

2 more replies

clentaminator10y ago

But who monitors the status of statuspage.io?

colinbartlett10y ago

I know you're joking, but I do: Check out https://StatusGator.com. StatusPage.io has a status page at metastatuspage.com which my company monitors.

ludbb10y ago

And who monitors your company monitors? :)

jessaustin10y ago

Are you also on AWS? Apparently you're not on DO...

karlgrz10y ago

This is the first DNS outage I've experienced with them in 3+ years, then again I host everything in their NY regions.

crisopolis10y ago

I've never experienced an outage of any kind with DO, so also first time. I also host all my droplets in the NYC regions.

josh_carterPDX10y ago

Same. They've been pretty reliable. Hoping this doesn't last long or we'll be looking to move off.

crisopolis10y ago

DigitalOcean uses CloudFlare for DNS - https://www.cloudflare.com/case-studies-digital-ocean/

jtokoph10y ago

This statement can be misleading. If you read the article, they don't use CloudFlare's DNS servers per se. They use CloudFlare's DNS proxy which acts as a DNS firewall between the DigitalOcean DNS servers and the world.

josh_carterPDX10y ago

I think the most annoying aspect of this outage are their updates. Three updates and they all say the same thing with no meaningful information as to what's causing this. Likely they may not have much information, but you'd think there would be something more than what they've been posting for the past hour. Good times!

coreyp_110y ago

Does anyone know of a good strategy for DNS failover?

takeda10y ago

That's one of the easiest thing to do. Just add multiple NS to the domain. As long as they are configured correctly and at least one is up (and for limited time even if none are up) the service is available.

With DNS you actually can achieve 100% uptime.

mattzito10y ago

Well, there's a couple of strategies:

- IP-diverse nameservers

- TLD-diverse nameservers

- BGP anycast

IP-diverse nameservers requires that you expect that your DNS servers will go down rather than start returning bad results - I highly recommend having some sort of mechanism to hard-terminate access to those machines.

TLD-diverse nameservers is just an extra strategy for reducing the risk that an upstream TLD issue will blow up your spot.

And then BGP anycast is the expensive, complicated piece of this - it requires a high level of technical sophistication, lots of moving parts, and the QA/validation piece of it is tricky.

When I built an anycast DNS system, we ended up resorting to tricks like having the DNS servers publish routes to the router for redistribution, so that a down or unresponsive server automatically withdrew the routes. Then you do things like TXT records for your zone that respond with which POP you're hitting in some sort of hashed/obfuscated fashion.

It's hard and complicated, and unnecessary for most folks. Better to outsource to Route 53 or someone similar.

blumentopf10y ago

- Implementation-diverse nameservers

Use multiple implementations, e.g. NSD/BIND for authoritative servers and Unbound/BIND for resolvers, to mitigate against implementation-specific bugs and vulnerabilities.

troydavis10y ago

A fair number of providers support zone transfers (AXFR requests) from master to slave name servers. The slaves can be operated by a different entity.

Here's DNSimple's implementation: https://support.dnsimple.com/articles/secondary-dns/

I wrote about moving from 1 to 2+ authoritative DNS providers: http://blog.papertrailapp.com/dns-outage-on-monday-december-.... I think this is just as true today:

> For .. maintainers of mission-critical DNS zones, the solution is to not depend on any single DNS infrastructure for functioning authoritative DNS

c17r10y ago

I don't know if DigitalOcean's DNS servers allow AXFR, if they do you can use a secondary DNS service to automatically replicate the DNS. You then list the secondary DNS as a NS for your domain.

If they don't allow AXFR -- and after this, they should! -- you can still have a secondary DNS provider but you'd have to duplicate any changes by hand. Not ideal but still doable.

dec0dedab0de10y ago

DNS has failover built into the protocol, just have another server listed at your registrar.

jamescun10y ago

I would be interested in the post-mortem from this. While DigitalOcean operate their own DNS, it is only made publicly available though CloudFlares DNS proxying service.

NewHatMatt10y ago

From @DOStatus a minute ago:

"Our engineering team has identified the issue, and are working to resolve connectivity issues to our DNS servers.... http://do.co/status"

https://twitter.com/DOStatus/status/713043871559655424

nodesocket10y ago

Recommend AWS Route53 very highly. Route53 also allows you to buy domain names and do lot's of fancy fail-over, geolocation, and CNAME alias at the apex magic.

samgranieri10y ago

A few years ago Slicehost had a DNS outage and the webscrapers I had running were falling over because they couldnt resolve DNS. I had to SSH into 8 boxes and update resolv.conf to add google DNS and openDNS as a backup. (Yes, I should've had centralized config management with chef or puppet or ansible)

crisopolis10y ago

That's crazy... I think by default DO droplets use Google DNS for resolving.

doublerebel10y ago

No offense to anyone here, but what is DO's SLA? Last time I looked, they did not have one.

DO is cheap for a reason. And that's the same reason I don't host with them, I can get SLA-backed infrastructure for a reasonable price and would have no excuse to my customers or cofounders.

bpicolo10y ago

Looks like they do have one: https://www.digitalocean.com/help/policy/

xir7810y ago

We have seamless DNS "failover" by running dnsmasq with the all-hosts option on all our servers. It causes dnsmasq to query all at once so if any go down its transparent to our apps. Works perfectly on our 1500 ec2 instances.

r1ch10y ago

I thought their DNS was supposed to be rock solid since they use Cloudflare Virtual DNS. Oh well, lesson learned. Back to running my own DNS servers on each droplet, if the DNS is down the droplet is likely down regardless :).

showerst10y ago

Feeling the pain here too. What DNS providers do others use and like? Route53?

rbritton10y ago

The sites I have that are actually up right now are those routed through CloudFlare.

stevekemp10y ago

Route53 is hard not to love; simple to develop against and very very reliable.

I wrap it in git to make updates more straightforward for people unfamiliar with AWS, but even using it directly is very simple from multiple languages. (https://dns-api.com/)

tyingq10y ago

This has always been a good resource to see who the front runners are: http://www.solvedns.com/dns-comparison/

dboreham10y ago

Bind, running on VMs. Not hard.

SteveNuts10y ago

You run your own authoritative DNS servers?

z9210y ago

I ran my non-authoritative DNS server [bind] on a droplet for about a year. But the server crashed every few months. Why? Never figured out. A restart always fixed it.

Later shifted to DO's DNS servers.

Now that that one is down too, just shifted back to domain register's DNS.

Everything is working now.

joejoebob10y ago

Where I work we use Rotue53. For my personal domains I just use my registrar, Namecheap.

dsp123410y ago

For us, Route53 is painful. We host a few thousand zones, and due to rate limiting on APIs, doing something like "Show a list of domain names" or "Give all the domains matching some pattern" were particularly painful. Upwards of 30 seconds to do a simple list of domains meant we were forced to cache locally. A local cache, combined with the fact that zone names are not unique in their system (possible to create multiple abc.com entries, which differ only in an internal id and the list of NS entries) made it hard to ensure that our internal systems matched "reality". Then the administrative nightmare of 3-4 different NS entries for each zone means customized, rather than generic, instructions for validating NS settings at the individual registrars.

All in all, it was not a fun experience with such a large volume of zones, but we knew we were an edge case.

1 more reply

hornbaker10y ago

dnsmadeeasy for around 8 years now

yakshaving_jgt10y ago

This is the second time recently their AMS region has gone down, which is where I host my email. What a pain.

satyajeet2310y ago

That awkward moment when it shows the status page

grej10y ago

This is causing huge huge pain for us, Digital Ocean.

colinbartlett10y ago

If you want to get alerted when it comes back up, or you wish you had been alerted when it went down, check out my project: https://StatusGator.com.

StatusGator monitors status pages and sends notifications via email, Slack, and others. You can get alerted to status changes inside Slack and you can ask it the status of a service with a /statuscheck command.

camikazeg10y ago

A bit of feedback: you should have a link back to your dashboard on every page. That seems like the most important page to me as a user, but if I am changing my notification or account settings, there is no way back to that page.

colinbartlett10y ago

Great feedback, thank you! Added that.

pmalynin10y ago

Yeah, tried to access our site and it was down. Really was expecting more out of Digital Ocean than to fuck up such an integral part of their infrastructure. In the future we'll be transitioning away from their DNS solution because this is unacceptable.

tehbeard10y ago

I hope your clients/users are as understanding and civil as you are.

In the meantime, I'm going wait for post-mortem before deciding if I should continue using them for dns. Looking back over the status history, 1-2 incidents a year isn't that bad for my needs, but might be too much for you, which is fine (since I'm only hosting a couple of small side projects with them).

pmalynin10y ago

The problem is, for an early stage startup incidents like this are deadly. Especially since we just applied to a bunch of accelerators.

tonyarkles10y ago

I get that this is a pain in the ass. I've got a significant chunk of infrastructure on DO, I've got work to do today that depends on those machines. I learned about this simultaneously when a deployment failed and I got a text from an engineer at a company I consult with. Not a great way to start the day, for sure.

Know what I'm going to do? I'm going to have a cup of coffee and play with my dogs for a bit. It's inconvenient, it's going to delay things, and I'm a bit choked about it. But it's not worth getting angry over, because there's nothing I can do about it today.

crisopolis10y ago

The resolution is, for any app/startup/business everything is a risk and if you didn't include the edge-case of "What happens if my primary DNS nameserver goes down for my domain?" into account. Is all you can do is blame DO?

If your app goes down do you have failover for that? Or do you blame your devops team?

2 more replies

tomschlick10y ago

It's still entirely your fault. Something like Route53/Cloudflare is dirt cheap and crazy redundant. Don't risk your business on free/side services.

paradite10y ago

You can go to your domain registrar and switch to another DNS provider (GoDaddy has their own DNS service).

ju-st10y ago

Be happy that this happend early. Now you know that you should never ever have a single point of failure.

bitJericho10y ago

Just add a second dns provider.

clentaminator10y ago

If your site is so critical that it can't suffer any downtime then why is it not provisioned across multiple independent platforms?

crisopolis10y ago

I also hope the users of your site understand that shit happens. Also as another user said... if DNS is so critical for you then why don't you have proper failover in place?

j / k navigate · click thread line to collapse

121 comments

tonylemesmer10y ago

People hating on DO "I'm losing thousands every hour". Well then should have had some failover in place if its that valuable.

[1]https://twitter.com/rodrigoespinosa/status/71303563702097100...

zackelan10y ago

AWS forum post: https://forums.aws.amazon.com/thread.jspa?threadID=65649&tst...

Previous HN discussion: https://news.ycombinator.com/item?id=2477345

crisopolis10y ago

I've been reading all the comments on Twitter also... like "err mai gawd I'm switching to AWS because of this" and your failure to not have a secondary DNS provider, but I highly doubt you'd switch.

Then another... "Today's @digitalocean DNS #outage is a reminder to not trust your entire business to one provider. Spread the love around!"

If your company is e-commerce and makes money by being 99.99% available. It's your own fault for no fail-over.

another... ".@digitalocean that's two hours without DNS now...my company's websites could be losing thousands of £ in e-commerce! Please, an update!"

codegeek10y ago

crisopolis10y ago

colinbartlett10y ago

I can't disagree with what you're saying, but I think we are all guilty of this. We expect more out of big name services than might be reasonable. (100% uptime)

tracker110y ago

That's because people like to complain... the reality is stuff happens, systems go down, and life tends to go on.

IgorPartola10y ago

On the plus side, in the past 4 years that I've used them this was the very first issue, and they fixed it within a couple of hours.

Anyone have any good recommendations on cheap or free backup DNS hosting?

nullrouted10y ago

Use www.dnsmadeeasy.com and then dns.he.net as your secondary dns service or vice-versa. They will do transfers/updates from each other and work just fine.

cleaver10y ago

In the information vacuum, I switched to Route 53. It works.

stevekemp10y ago

Route53 is a wonderful service, which has never had a global outage (so far).

I've wrapped it in git, to allow very quick updates to be made in a simple fashion:

https://dns-api.com/

scurvy10y ago

How do you fail over your SOA on .com when the minimum TTL is 1 day?

sfilipov10y ago

The answer probably is not very helpful but... change the TTL to a lower value? Is it essential to have such a big TTL of 1 day?

scurvy10y ago

Sure, have fun convincing Verisign. They're the ones controlling the .com registry. Your TTL values are meaningless to them when it comes to SOA. Minimum 1 day TTL.

1 more reply

chronid10y ago

DNS is hard. Very hard.

It may seems trivial when it works (hint: it's not), but some of the biggest fuck ups I've seen in my professional life were caused by strange DNS things happening or DNS servers going kaboom.

I feel the pain of the DO engineers trying to mitigate this issue. I really do.

johansch10y ago

BS. DNS is a trivial thing to scale, compared to most other web-scale efforts.

Things break when people don't use 20 year old best practices. There is no defense against inexperience and ignorance.

takeda10y ago

I took the OPs comment as "it's hard to understand DNS and biggest fuck ups happen because people think they understand DNS when they actually don't".

johansch10y ago

All right. This I can agree with.

thyrsus10y ago

Please help the ignorant and provide a link to a description of those best practices.

dividuum10y ago

> I feel the pain of the DO engineers trying to mitigate this issue. I really do.

Thaxll10y ago

It's not hard, the problem is everything relies on DNS so when DNS goes down or has problems you have cascading failure.

bitJericho10y ago

That's why you use multiple providers.

4 more replies

tyingq10y ago

One thing hosting providers could do better would be to split the risk a little by not handing the same dns server name to every client that chooses to have the hosting provider supply dns services.

So, if DO would come up with say, 10 separate hostnames they could hand out, then this sort of thing would take down 10% of their customers instead of 100%.

traviswingo10y ago

We really need fail-overs in place...small team problems.

brndn10y ago

If you are showing a demo or something, you can still navigate to your DO IP address. Of course, I don't know if other things (images, etc.) on your website also rely on their DNS.

Maakuth10y ago

For that kind of emergency, you can point the domain name to the IP address via /etc/hosts (there's a similar file in Windows as well).

tracker110y ago

c:\windows\system32\drivers\etc\hosts

defenestration10y ago

pbhjpbhj10y ago

traviswingo10y ago

We just switched it over to Route53 and set up some fail-overs there. Took us 5 mins and we're back online.

Looks like DO is still offline so it seems to have been a good call...

1 more reply

sashk10y ago

My rule: provider should do single thing:

- Hosting provider - host sites

- vps/cloud provider - provide VMs

- domain registrar - domain related stuff, but not DNS

- dns provider - host dns

- second dns provider - host dns in case first dns provider fails

So many DNS outages recently and all my projects are up.

copperx10y ago

Does Amazon's Route 53 count as a DNS provider, or do you treat it a hosting provider?

sashk10y ago

For me - neither.

But if I'd be tied in into Amazon's cloud infrastructure, I would have to use many of their features going against my rules above.

ludbb10y ago

nlivingstone10y ago

Have multiple VMs @ Digital Ocean (TOR1), we use Cloudflare for DNS... All site have remained available and successfully fulfilling requests.

cleaver10y ago

Every site where I was using external DNS stayed up.

fredophile10y ago

mrideout10y ago

Here's my testing recommendation:

1. Pick some subset of your DNS records to monitor, or all of them if you want to be extra thorough. If you are picking a subset, then I'd pick whatever records are most critical to your business.

If you'd like to dig into the details of DNS, then O'Reilly's "DNS and BIND" is highly recommended, even if you're not using BIND.

Shameless plug: I created a DNS monitoring service that can be used used for monitoring each of your name servers: https://www.dnscheck.co/

tonyarkles10y ago

The flipside to the dog food point: if DigitalOcean did use their own nameservers for their main site, then we wouldn't have been able to see the status page.

mrideout10y ago

Good point!

Rezo10y ago

crisopolis10y ago

So what you're saying is that instead of running their own Status Page on their own infrastructure that's reachable. They should outsource it to statuspage.io and pay another company to do it?

Rezo10y ago

Yes, that's pretty standard. Availability monitoring and status reporting should be external and separate from your own infrastructure, otherwise neither may be available when you need it the most.

dsr_10y ago

And don't use statuspage.io if your host is AWS, because theirs is too.

2 more replies

clentaminator10y ago

But who monitors the status of statuspage.io?

colinbartlett10y ago

I know you're joking, but I do: Check out https://StatusGator.com. StatusPage.io has a status page at metastatuspage.com which my company monitors.

ludbb10y ago

And who monitors your company monitors? :)

jessaustin10y ago

Are you also on AWS? Apparently you're not on DO...

karlgrz10y ago

This is the first DNS outage I've experienced with them in 3+ years, then again I host everything in their NY regions.

crisopolis10y ago

I've never experienced an outage of any kind with DO, so also first time. I also host all my droplets in the NYC regions.

josh_carterPDX10y ago

Same. They've been pretty reliable. Hoping this doesn't last long or we'll be looking to move off.

crisopolis10y ago

DigitalOcean uses CloudFlare for DNS - https://www.cloudflare.com/case-studies-digital-ocean/

jtokoph10y ago

josh_carterPDX10y ago

coreyp_110y ago

Does anyone know of a good strategy for DNS failover?

takeda10y ago

With DNS you actually can achieve 100% uptime.

mattzito10y ago

Well, there's a couple of strategies:

- IP-diverse nameservers

- TLD-diverse nameservers

- BGP anycast

TLD-diverse nameservers is just an extra strategy for reducing the risk that an upstream TLD issue will blow up your spot.

And then BGP anycast is the expensive, complicated piece of this - it requires a high level of technical sophistication, lots of moving parts, and the QA/validation piece of it is tricky.

It's hard and complicated, and unnecessary for most folks. Better to outsource to Route 53 or someone similar.

blumentopf10y ago

- Implementation-diverse nameservers

Use multiple implementations, e.g. NSD/BIND for authoritative servers and Unbound/BIND for resolvers, to mitigate against implementation-specific bugs and vulnerabilities.

troydavis10y ago

A fair number of providers support zone transfers (AXFR requests) from master to slave name servers. The slaves can be operated by a different entity.

Here's DNSimple's implementation: https://support.dnsimple.com/articles/secondary-dns/

I wrote about moving from 1 to 2+ authoritative DNS providers: http://blog.papertrailapp.com/dns-outage-on-monday-december-.... I think this is just as true today:

> For .. maintainers of mission-critical DNS zones, the solution is to not depend on any single DNS infrastructure for functioning authoritative DNS

c17r10y ago

I don't know if DigitalOcean's DNS servers allow AXFR, if they do you can use a secondary DNS service to automatically replicate the DNS. You then list the secondary DNS as a NS for your domain.

If they don't allow AXFR -- and after this, they should! -- you can still have a secondary DNS provider but you'd have to duplicate any changes by hand. Not ideal but still doable.

dec0dedab0de10y ago

DNS has failover built into the protocol, just have another server listed at your registrar.

jamescun10y ago

I would be interested in the post-mortem from this. While DigitalOcean operate their own DNS, it is only made publicly available though CloudFlares DNS proxying service.

NewHatMatt10y ago

From @DOStatus a minute ago:

"Our engineering team has identified the issue, and are working to resolve connectivity issues to our DNS servers.... http://do.co/status"

https://twitter.com/DOStatus/status/713043871559655424

nodesocket10y ago

Recommend AWS Route53 very highly. Route53 also allows you to buy domain names and do lot's of fancy fail-over, geolocation, and CNAME alias at the apex magic.

samgranieri10y ago

crisopolis10y ago

That's crazy... I think by default DO droplets use Google DNS for resolving.

doublerebel10y ago

No offense to anyone here, but what is DO's SLA? Last time I looked, they did not have one.

DO is cheap for a reason. And that's the same reason I don't host with them, I can get SLA-backed infrastructure for a reasonable price and would have no excuse to my customers or cofounders.

bpicolo10y ago

Looks like they do have one: https://www.digitalocean.com/help/policy/

xir7810y ago

r1ch10y ago

showerst10y ago

Feeling the pain here too. What DNS providers do others use and like? Route53?

rbritton10y ago

The sites I have that are actually up right now are those routed through CloudFlare.

stevekemp10y ago

Route53 is hard not to love; simple to develop against and very very reliable.

I wrap it in git to make updates more straightforward for people unfamiliar with AWS, but even using it directly is very simple from multiple languages. (https://dns-api.com/)

tyingq10y ago

This has always been a good resource to see who the front runners are: http://www.solvedns.com/dns-comparison/

dboreham10y ago

Bind, running on VMs. Not hard.

SteveNuts10y ago

You run your own authoritative DNS servers?

z9210y ago

I ran my non-authoritative DNS server [bind] on a droplet for about a year. But the server crashed every few months. Why? Never figured out. A restart always fixed it.

Later shifted to DO's DNS servers.

Now that that one is down too, just shifted back to domain register's DNS.

Everything is working now.

joejoebob10y ago

Where I work we use Rotue53. For my personal domains I just use my registrar, Namecheap.

dsp123410y ago

All in all, it was not a fun experience with such a large volume of zones, but we knew we were an edge case.

1 more reply

hornbaker10y ago

dnsmadeeasy for around 8 years now

yakshaving_jgt10y ago

This is the second time recently their AMS region has gone down, which is where I host my email. What a pain.

satyajeet2310y ago

That awkward moment when it shows the status page

grej10y ago

This is causing huge huge pain for us, Digital Ocean.

colinbartlett10y ago

If you want to get alerted when it comes back up, or you wish you had been alerted when it went down, check out my project: https://StatusGator.com.

camikazeg10y ago

colinbartlett10y ago

Great feedback, thank you! Added that.

pmalynin10y ago

tehbeard10y ago

I hope your clients/users are as understanding and civil as you are.

pmalynin10y ago

The problem is, for an early stage startup incidents like this are deadly. Especially since we just applied to a bunch of accelerators.

tonyarkles10y ago

crisopolis10y ago

If your app goes down do you have failover for that? Or do you blame your devops team?

2 more replies

tomschlick10y ago

It's still entirely your fault. Something like Route53/Cloudflare is dirt cheap and crazy redundant. Don't risk your business on free/side services.

paradite10y ago

You can go to your domain registrar and switch to another DNS provider (GoDaddy has their own DNS service).

ju-st10y ago

Be happy that this happend early. Now you know that you should never ever have a single point of failure.

bitJericho10y ago

Just add a second dns provider.

clentaminator10y ago

If your site is so critical that it can't suffer any downtime then why is it not provisioned across multiple independent platforms?

crisopolis10y ago

I also hope the users of your site understand that shit happens. Also as another user said... if DNS is so critical for you then why don't you have proper failover in place?

j / k navigate · click thread line to collapse