How Stack Overflow plans to survive the next DNS attack (opens in new tab)

(blog.serverfault.com)

133 pointssamhamilton9y ago40 comments

40 comments

I think I measured Cloudflare's performance and chose it over Google because it was consistently faster. If the stack-stackers are reading, I'd love to hear why they didn't make the list.

Also, it'd be a great public service to publish the results. Even if it's just enabled for a day per year or so the results would probably appreciated by many. And you could always sell your altruism as the need to continually monitor the situation :)

thefarseeker9y ago

Author here. CloudFlare's DNS was measurably, and consistently, faster than almost every provider we tested. So I can concur with you in that regards. It was ultimately faster than Route 53 and Google Cloud (the path we ended up going).

However, when our relationship with CloudFlare ended and we moved to Fastly, one of the reasons we did so was unreliable lag with DNS updates at CloudFlare. Sometimes it would take minutes or hours for DNS changes to be affected in their system, and they could never provide a satisfactory reason why.

samhamiltonOP9y ago

+1 - I am also very interested why they chose to switch out from CloudFlare for both their DNS and their CDN and over to Fastly. Nick Craver did a write up where they specifically mentioned [1] Cloudflare for both their DNS and CDN.

Do you think after the Dyn outage everyones sysadmins are running round adding redundancy, too worried to trust the uptime of their site in the hands of just CloudFlare?

[1] http://nickcraver.com/blog/2016/02/17/stack-overflow-the-arc...

dx0349y ago

The fact that they abandoned Cloudflare only 6 months after this post means that they must've been pretty disappointed. I wonder if this is the same reason as for some of the other pages that went to Fastly from Cloudflare (e.g. Imgur).

1 more reply

deno9y ago

tl;dr: Javascript/Google reCaptcha Paywall.

Source:

https://meta.stackoverflow.com/questions/323537/cloudflare-i...

1 more reply

fstd9y ago

Can't believe they're running on Windows. Yikes.

captncraig9y ago

One major issue with their dns was unpredictable lag while changing records. On most occasions we can change a record and it would propigate immediately. In some rare cases, it could take fifteen minutes, and in a few instances up to eight hours before the change would show up on their nameservers.

Cloudflare was unable to diagnose this or give us any comfort that it would improve.

bks9y ago

Umm, brilliant thank you for this.

I ended up with a Dyn / Route53 configuration. We used libcloud to sync everything together. We also added the exported zone to Cloudflare but did not enable it.

We had actually planned for this, but in no way did we ever come close to your in depth testing. The @ Azure issue - thank you for uncovering this for the rest of us.

tibu9y ago

Can you maybe share how you did the sync between them? Also there are already some tools pulling zone data from Dyn and adding it to Route53? Can you share why did you choose an own sync. (I'm planning to do the same and I'm interested in other's opinions)

captncraig9y ago

We actually wrote a tool to manage this. We define our desired records in a common dsl format, and the tool can interact with various providers to ensure things match the expected state.

We should be open sourcing this rather shortly, so stay tuned.

Sorry, I'm not who you asked, but that is how we are doing it at stack overflow now.

matt40779y ago

The calculation regarding the ideal number of name servers to list needs some empirical data regarding the likelihood of provider and server outages and the client reactions to it, right? Because otherwise 2 would must be the best number, if I'm not mistaken (Chance of hitting the provider that's offline is always 0.5 on the first try, but the second try would be guaranteed to hit the other).

Here's the math for expected number of tries if half of the servers are offline. (It's a hypergeometric distribution but I couldn't find a closed formula)

E(2 server) = 1 * 1/2 + 2 * 1/2 = 1.5

E(4 server) = 1 * 2/4 + 2 * 2/4 * 2/3 + 3 * 2/4 * 1/3 = 1.67

E(8 server) = 1 * 4/8 + 2 * 4/8 * 4/7 + 3 * 4/8 * 3/7 * 4/6 + 4 * 4/8 * 3/7 * 2/6 *4/5 = 1.73

thefarseeker9y ago

Author here. One thing I didn't cover in the post was about how to have Google and AWS honour their SLA's is that you have to use _all 4_ of the nameservers they provide. Because we're not doing that (only using half), we have to balance the chance of an outage versus the impact of an outage.

You are correct in saying that more empirical data could be used here. We might even end up changing our minds. I'm not much of a numbers person but I might pass this onto some of the people in our company who love solving problems like this.

jlgaddis9y ago

It'd be great if more DNS providers supported "slaving" a zone from an existing server. It would make it much easier to keep DNS synchronized across multiple providers.

Hurricane Electric supports this but most of the providers mentioned in this article do not.

thefarseeker9y ago

Author here. I agree. There are built-in mechanisms for doing this - AXFR and IXFR. However, these mechanisms were not really designed with this sort of scale in mind. You have to keep an up to date whitelist of all the servers that can talk to each other, and they would need to talk to each other on a non-anycasted address (otherwise the notify packet would go to just a single anycasted node).

Managing whitelists between multiple 3rd party DNS providers is likely to break frequently as servers move around, are added, removed, etc.

Interestingly, Hurricane Electric would have been one of our top choices if they had a first class API and a commercial SLA. Their ability to support zone transfers is admirable and did not go un-noticed. DNS Made Easy also supports zone transfers.

jlgaddis9y ago

Just as an additional data point for anyone else reading this...

Hurricane Electric supports zone transfers and requires you to only allow AXFR's from a single host -- slave.dns.he.net (IPv4: 216.218.133.2, IPv6: [2001:470:600::2]). NOTIFYs should not be sent to slave.dns.he.net but instead to ns1.he.net.

n.b.: ns1.he.net is not anycasted, but ns[2-5] are. In addition, ns1 does not have an AAAA RR.

We (ISP) currently run our own authoritative name servers in our own facilities but I've been seriously debating adding another provider into the mix so "secondary" service is an important feature to me.

kosinus9y ago

This'd be awesome, and there are very few providers which offer this. When looking into this, I could only find HE and BuddyNS.

Everyone seems to be inventing their own custom API for this, which I guess is the 'modern developer friendly' approach, but it results in a bit of a mess. Example: Caddy's implementation of the Let's Encrypt / ACME dns-01 challenge has all these plugins: https://caddyserver.com/download

We ended up running our own authoritative nameservers, which is not ideal. But at least cloud offerings allow you to spread across regions.

stevekemp9y ago

Sadly most of the bigger providers require you to code to their API - which makes migrations a little more complex - and that goes double if you get locked into using special DNS-records (rather than common types such as A/AAAA/MX/etc).

That's one of the reasons why the DNS hosting I support, which uses git-hooks to trigger updates, only currently pushes the DNS data to Amazon's route53 infrastructure.

At the time of the most recent Dyn outage I looked at allowing users to support multiple back-ends, to abstract away the pain of redundancy, but it seemed there was surprisingly little interest.

jlgaddis9y ago

I'm glad more and more providers are offering APIs these days but the important feature for me is the ability to slave off of my own servers.

We (ISP) run our own authoritative name servers. Ideally, I'd have a single hidden ("stealth") master (maybe two, w/ anycast) and all of the public name servers would simply slave from that one. If you run PowerDNS -- which supports MySQL/PostgreSQL backends, among others -- you can keep everything in a local database and use standard tools (or write your own) to manage it.

(If I was pretty much anywhere besides an ISP, I'd definitely be using a provider with a fully-featured API. I use Route 53 now for my personal domains but I manage the zones by hand in the console since the RRs practically never change.)

2 more replies

ksec9y ago

From my experience EdgeCast and DNSMadeEasy were consistently the fastest DNS. I guess both were dropped because of price when Google DNS and Route53 did the a similar job.

And as other have said, while Cloudflare may not be for everyone, their DNS is possibly the fastest. Not sure why SO decide to drop them.

*Some old Data http://www.dnsperf.com/

I also wonder on the performance of DNSimple. But they dont see to emphasis much on performance.

thefarseeker9y ago

So DNSMadeEasy made it all the way through the barrage of tests. I even wrote a library for their API so that we could integrate it into our DNS software (https://github.com/mhenderson-so/godnsmadeeasy), but at the end of the day their performance in certain regions was not good enough. In some countries they were measurably faster than R53, but in others they were measurably much slower.

EdgeCast were dropped due to pricing, and that there's talk of Verizon selling the EdgeCast services again.

DNSimple didn't make it to performance testing because they only had 5 POPs, as opposed to 20+ of other providers.

CloudFlare's DNS was consistently one of the fastest, you are correct about that. If you read my responses to other comments here, you'll find that we decided not use their DNS service because of some fairly pervasive API issues we had with it.

elktea9y ago

Netflix have a tool for this as well https://github.com/Netflix/denominator

captncraig9y ago

We have a pretty cool tool for managing this at Stack. Changing from a single provider to two can literally be done by changing one line of config. We should be open sourcing it very soon, and a blog post on it sometime thereafter.

skuhn9y ago

Denominator is not actively developed: https://github.com/Netflix/denominator/issues/374

Last commit of substance was in Sept 2015.

majewsky9y ago

I have not looked at this particular commit history, but I want to argue against the notion that no fresh commits must mean that a project is abandoned. Some stuff is just mostly finished after some point.

1 more reply

paulddraper9y ago

Weird. I attended a lecture about it not too long ago.

I wonder what Netflix is doing instead.

Mojah9y ago

I'm currently working on a tool [1] that can help with checking if all your different providers are 'in sync' and responding with the same answers. Setups like these are only to grow more common as people realise a single DNS provider is a SPOF of its own.

Very good analysis of SO and a smart move to roll this out _before_ a new DNS outage!

https://dnsspy.io

SteveNuts9y ago

I've thought about doing something like this, the biggest issue I found is feature parity between DNS providers.

If you could have a unified API that would create the records on multiple providers that would be money, it's just that you'd lose out on some things like Route 53 health checking, etc.

cuu5089y ago

Is there a good writeup somewhere about setting up redundant NS records at the zone apex? Or, more generally, "DNS primer for busy developer" article?

skuhn9y ago

Once you have configured your zone with multiple providers, it's simply a matter of adding NS entries for each provider's authoritative servers to your registrar. The harder part is ensuring that the zones are kept in sync and that you don't rely on features (such as GSLB stuff or ALIAS records) that aren't available with all providers.

It's up to the client resolver to handle failover, so it's not perfect in terms of availability, but better than nothing.

For example:

  $ dig ns amazon.com
  amazon.com.		3599	IN	NS	ns4.p31.dynect.net.
  amazon.com.		3599	IN	NS	ns1.p31.dynect.net.
  amazon.com.		3599	IN	NS	ns3.p31.dynect.net.
  amazon.com.		3599	IN	NS	ns2.p31.dynect.net.
  amazon.com.		3599	IN	NS	pdns1.ultradns.net.
  amazon.com.		3599	IN	NS	pdns6.ultradns.co.uk.

(note that this is also TLD redundant, since there's a .co.uk included)

captncraig9y ago

The tricky part is making sure the apex NS records are consistent across all authoritative nameservers. A surprising number of dns providers do not let you edit those.

1 more reply

vaara9y ago

I wonder why there's no consideration of anycast servers.

thefarseeker9y ago

Author here. Google and AWS are anycasted services. We were not interested in running our own anycasted DNS due to the management overhead of doing so, and because the cost of anycasting our own services would be orders of magnitude more expensive than outsourcing that to an established DNS provider.

j / k navigate · click thread line to collapse

40 comments

matt40779y ago

I think I measured Cloudflare's performance and chose it over Google because it was consistently faster. If the stack-stackers are reading, I'd love to hear why they didn't make the list.

thefarseeker9y ago

samhamiltonOP9y ago

Do you think after the Dyn outage everyones sysadmins are running round adding redundancy, too worried to trust the uptime of their site in the hands of just CloudFlare?

[1] http://nickcraver.com/blog/2016/02/17/stack-overflow-the-arc...

dx0349y ago

1 more reply

deno9y ago

tl;dr: Javascript/Google reCaptcha Paywall.

Source:

https://meta.stackoverflow.com/questions/323537/cloudflare-i...

1 more reply

fstd9y ago

Can't believe they're running on Windows. Yikes.

captncraig9y ago

Cloudflare was unable to diagnose this or give us any comfort that it would improve.

bks9y ago

Umm, brilliant thank you for this.

I ended up with a Dyn / Route53 configuration. We used libcloud to sync everything together. We also added the exported zone to Cloudflare but did not enable it.

We had actually planned for this, but in no way did we ever come close to your in depth testing. The @ Azure issue - thank you for uncovering this for the rest of us.

tibu9y ago

captncraig9y ago

We actually wrote a tool to manage this. We define our desired records in a common dsl format, and the tool can interact with various providers to ensure things match the expected state.

We should be open sourcing this rather shortly, so stay tuned.

Sorry, I'm not who you asked, but that is how we are doing it at stack overflow now.

matt40779y ago

Here's the math for expected number of tries if half of the servers are offline. (It's a hypergeometric distribution but I couldn't find a closed formula)

E(2 server) = 1 * 1/2 + 2 * 1/2 = 1.5

E(4 server) = 1 * 2/4 + 2 * 2/4 * 2/3 + 3 * 2/4 * 1/3 = 1.67

E(8 server) = 1 * 4/8 + 2 * 4/8 * 4/7 + 3 * 4/8 * 3/7 * 4/6 + 4 * 4/8 * 3/7 * 2/6 *4/5 = 1.73

thefarseeker9y ago

jlgaddis9y ago

It'd be great if more DNS providers supported "slaving" a zone from an existing server. It would make it much easier to keep DNS synchronized across multiple providers.

Hurricane Electric supports this but most of the providers mentioned in this article do not.

thefarseeker9y ago

Managing whitelists between multiple 3rd party DNS providers is likely to break frequently as servers move around, are added, removed, etc.

jlgaddis9y ago

Just as an additional data point for anyone else reading this...

n.b.: ns1.he.net is not anycasted, but ns[2-5] are. In addition, ns1 does not have an AAAA RR.

kosinus9y ago

This'd be awesome, and there are very few providers which offer this. When looking into this, I could only find HE and BuddyNS.

We ended up running our own authoritative nameservers, which is not ideal. But at least cloud offerings allow you to spread across regions.

stevekemp9y ago

That's one of the reasons why the DNS hosting I support, which uses git-hooks to trigger updates, only currently pushes the DNS data to Amazon's route53 infrastructure.

At the time of the most recent Dyn outage I looked at allowing users to support multiple back-ends, to abstract away the pain of redundancy, but it seemed there was surprisingly little interest.

jlgaddis9y ago

I'm glad more and more providers are offering APIs these days but the important feature for me is the ability to slave off of my own servers.

2 more replies

ksec9y ago

From my experience EdgeCast and DNSMadeEasy were consistently the fastest DNS. I guess both were dropped because of price when Google DNS and Route53 did the a similar job.

And as other have said, while Cloudflare may not be for everyone, their DNS is possibly the fastest. Not sure why SO decide to drop them.

*Some old Data http://www.dnsperf.com/

I also wonder on the performance of DNSimple. But they dont see to emphasis much on performance.

thefarseeker9y ago

EdgeCast were dropped due to pricing, and that there's talk of Verizon selling the EdgeCast services again.

DNSimple didn't make it to performance testing because they only had 5 POPs, as opposed to 20+ of other providers.

elktea9y ago

Netflix have a tool for this as well https://github.com/Netflix/denominator

captncraig9y ago

skuhn9y ago

Denominator is not actively developed: https://github.com/Netflix/denominator/issues/374

Last commit of substance was in Sept 2015.

majewsky9y ago

1 more reply

paulddraper9y ago

Weird. I attended a lecture about it not too long ago.

I wonder what Netflix is doing instead.

Mojah9y ago

Very good analysis of SO and a smart move to roll this out _before_ a new DNS outage!

https://dnsspy.io

SteveNuts9y ago

I've thought about doing something like this, the biggest issue I found is feature parity between DNS providers.

If you could have a unified API that would create the records on multiple providers that would be money, it's just that you'd lose out on some things like Route 53 health checking, etc.

cuu5089y ago

Is there a good writeup somewhere about setting up redundant NS records at the zone apex? Or, more generally, "DNS primer for busy developer" article?

skuhn9y ago

It's up to the client resolver to handle failover, so it's not perfect in terms of availability, but better than nothing.

For example:

  $ dig ns amazon.com
  amazon.com.		3599	IN	NS	ns4.p31.dynect.net.
  amazon.com.		3599	IN	NS	ns1.p31.dynect.net.
  amazon.com.		3599	IN	NS	ns3.p31.dynect.net.
  amazon.com.		3599	IN	NS	ns2.p31.dynect.net.
  amazon.com.		3599	IN	NS	pdns1.ultradns.net.
  amazon.com.		3599	IN	NS	pdns6.ultradns.co.uk.

(note that this is also TLD redundant, since there's a .co.uk included)

captncraig9y ago

The tricky part is making sure the apex NS records are consistent across all authoritative nameservers. A surprising number of dns providers do not let you edit those.

1 more reply

vaara9y ago

I wonder why there's no consideration of anycast servers.

thefarseeker9y ago

j / k navigate · click thread line to collapse