[1]https://twitter.com/rodrigoespinosa/status/71303563702097100...
AWS forum post: https://forums.aws.amazon.com/thread.jspa?threadID=65649&tst...
Previous HN discussion: https://news.ycombinator.com/item?id=2477345
Then another... "Today's @digitalocean DNS #outage is a reminder to not trust your entire business to one provider. Spread the love around!"
If your company is e-commerce and makes money by being 99.99% available. It's your own fault for no fail-over.
another... ".@digitalocean that's two hours without DNS now...my company's websites could be losing thousands of £ in e-commerce! Please, an update!"
How many of us here have failover email services in case Gmail goes down? I think many companies would say they'd lose thousands in productivity if Google Apps suffers an outage yet I'd hazard that very few have failover plans.
Yeah, if your building full of employees can't work because the internet is down, and the secondary is also down, then that's kind of crappy, and you may be paying people to twiddle their thumbs... much short of that, it's kind of the cost of doing business...
There are redundancy options for a lot of things... If you're using only a single host provider for your infrastructure, and management scoffs at creating redundant, and under-utilized systems... it's not as "mission critical" as people think/say.
On the plus side, in the past 4 years that I've used them this was the very first issue, and they fixed it within a couple of hours.
Anyone have any good recommendations on cheap or free backup DNS hosting?
I understand that things break and I should be ready for it. What I found unacceptable were the status updates. Basically, "we're working on it". No clue as to what was going on. A DDoS? Not a DDoS? Routing issues? Corrupt zone files? No clue? Any of those would be helpful as I needed to figure out if I should wait it out, or switch to Route 53.
In the information vacuum, I switched to Route 53. It works.
I've wrapped it in git, to allow very quick updates to be made in a simple fashion:
It may seems trivial when it works (hint: it's not), but some of the biggest fuck ups I've seen in my professional life were caused by strange DNS things happening or DNS servers going kaboom.
I feel the pain of the DO engineers trying to mitigate this issue. I really do.
Things break when people don't use 20 year old best practices. There is no defense against inexperience and ignorance.
The problem with DNS is that it can work even when it is configured incorrectly. This makes people who has no idea what they are doing that they actually understand it. The strange issues with DNS only happen with strange configurations. When you follow best practices everything is predictable.
Me too. Just last week they had another problem with DNS on the client side of things: Resolving with the Google Public DNS, which most droplets use by default, didn't work reliably. I hope that they post a combined post mortem for both of those incidents.
The reason this might have some upside is that DDOS attacks against a specific DNS server are often intended to target one specific customer of a hosting provider. The attacker doesn't care about the side effects...just the original target.
Say, for example "controversialblog.com" is hosted on DO, and uses DO dns servers. The person attacking "controversialblog.com" looks up the NS records for the domain, and attacks that DNS server. The fact that it's one hostname that serves all of DO is of little interest to the attacker.
So, if DO would come up with say, 10 separate hostnames they could hand out, then this sort of thing would take down 10% of their customers instead of 100%.
We really need fail-overs in place...small team problems.
Looks like DO is still offline so it seems to have been a good call...
- Hosting provider - host sites
- vps/cloud provider - provide VMs
- domain registrar - domain related stuff, but not DNS
- dns provider - host dns
- second dns provider - host dns in case first dns provider fails
So many DNS outages recently and all my projects are up.
But if I'd be tied in into Amazon's cloud infrastructure, I would have to use many of their features going against my rules above.
1. Pick some subset of your DNS records to monitor, or all of them if you want to be extra thorough. If you are picking a subset, then I'd pick whatever records are most critical to your business.
2. Setup monitoring that queries each of your authoritative name servers for each of the records that you identified in the previous step. The monitoring should notify you if any of the name servers are unresponsive, or return a different response than what's expected.
If you'd like to dig into the details of DNS, then O'Reilly's "DNS and BIND" is highly recommended, even if you're not using BIND.
There are a number of quality hosting providers out there. A rule of thumb that I use is this: If a DNS hosting provider doesn't eat their own dog food, don't trust them to handle your DNS. Digital Ocean doesn't use their own name servers for their main website's domain. Neither does Amazon.
Shameless plug: I created a DNS monitoring service that can be used used for monitoring each of your name servers: https://www.dnscheck.co/
With DNS you actually can achieve 100% uptime.
- IP-diverse nameservers
- TLD-diverse nameservers
- BGP anycast
IP-diverse nameservers requires that you expect that your DNS servers will go down rather than start returning bad results - I highly recommend having some sort of mechanism to hard-terminate access to those machines.
TLD-diverse nameservers is just an extra strategy for reducing the risk that an upstream TLD issue will blow up your spot.
And then BGP anycast is the expensive, complicated piece of this - it requires a high level of technical sophistication, lots of moving parts, and the QA/validation piece of it is tricky.
When I built an anycast DNS system, we ended up resorting to tricks like having the DNS servers publish routes to the router for redistribution, so that a down or unresponsive server automatically withdrew the routes. Then you do things like TXT records for your zone that respond with which POP you're hitting in some sort of hashed/obfuscated fashion.
It's hard and complicated, and unnecessary for most folks. Better to outsource to Route 53 or someone similar.
Use multiple implementations, e.g. NSD/BIND for authoritative servers and Unbound/BIND for resolvers, to mitigate against implementation-specific bugs and vulnerabilities.
Here's DNSimple's implementation: https://support.dnsimple.com/articles/secondary-dns/
I wrote about moving from 1 to 2+ authoritative DNS providers: http://blog.papertrailapp.com/dns-outage-on-monday-december-.... I think this is just as true today:
> For .. maintainers of mission-critical DNS zones, the solution is to not depend on any single DNS infrastructure for functioning authoritative DNS
If they don't allow AXFR -- and after this, they should! -- you can still have a secondary DNS provider but you'd have to duplicate any changes by hand. Not ideal but still doable.
"Our engineering team has identified the issue, and are working to resolve connectivity issues to our DNS servers.... http://do.co/status"
DO is cheap for a reason. And that's the same reason I don't host with them, I can get SLA-backed infrastructure for a reasonable price and would have no excuse to my customers or cofounders.
I wrap it in git to make updates more straightforward for people unfamiliar with AWS, but even using it directly is very simple from multiple languages. (https://dns-api.com/)
Later shifted to DO's DNS servers.
Now that that one is down too, just shifted back to domain register's DNS.
Everything is working now.
All in all, it was not a fun experience with such a large volume of zones, but we knew we were an edge case.
StatusGator monitors status pages and sends notifications via email, Slack, and others. You can get alerted to status changes inside Slack and you can ask it the status of a service with a /statuscheck command.
In the meantime, I'm going wait for post-mortem before deciding if I should continue using them for dns. Looking back over the status history, 1-2 incidents a year isn't that bad for my needs, but might be too much for you, which is fine (since I'm only hosting a couple of small side projects with them).
Know what I'm going to do? I'm going to have a cup of coffee and play with my dogs for a bit. It's inconvenient, it's going to delay things, and I'm a bit choked about it. But it's not worth getting angry over, because there's nothing I can do about it today.
If your app goes down do you have failover for that? Or do you blame your devops team?