Shuffle Sharding: massive and magical fault isolation (opens in new tab)

(awsarchitectureblog.com)

50 pointsmotter12y ago10 comments

10 comments

Great stuff, but the same generally beneficial approach taken too far can run into its own problems.

http://hackingdistributed.com/2014/02/14/chainsets/

To put it simply, seven million is not a big number, and it's the wrong number anyway. The author confused permutations and combinations; the correct number of four-card hands from a deck including jokers is only 316251. For the more common N=3 it's a paltry 24804. If you're doing "pick any N" to choose replica sets for millions of objects (for example) then pretty quickly every node will have a sharding relationship with every other. The probability of a widespread failure wiping out every member of some shard - leading to loss of data or service - approaches one. You're better off constraining the permutations somehow, certainly not all the way down to the bare minimum, but so that the total probability of data/service loss after N failures remains small.

I really hope people actually do the math instead of just cargo-culting an idea with a catchy name.

colmmacc12y ago

Author here! Well that's embarrassing, you're right about the 7 million number. I've updated the post to correct it. We'll be following up in a later blog post on the numbers we use for Route 53.

notacoward12y ago

Awesome, thank you. Those of us who waste^H^H^H^H^Hspend our lives pondering these things need to stick together. Cheers.

1 more reply

bmn12y ago

Isn't the number used later in the article incorrect too?

The article says: "Thus the real impact is constrained to 1/56th of the overall shuffle shards." Shouldn't it be 1/28th? It's 8*7 / 2 since the permutations "shards x, y" and "shards y, x" are the same as far as fault-tolerance is concerned.

bmm6o12y ago

Is anyone in this area looking at balanced block designs? There's a rich area of math concerned with placing elements in sets subject to various constraints (e.g. minimizing cardinality of intersections). It's been a long time since I looked at that stuff, but I don't remember how easy it is to produce solutions for arbitrary N, which you would want to be able to increase as demand increases.

danudey12y ago

Our approach has turned out to work really well, and very simply.

We have N servers in round-robin DNS. When our mobile client starts up, it does a DNS lookup, fetches the entire list of servers, and then picks one to connect to. If that connection fails, it tries another one, etc. until it runs out (which has never happened).

We also ship the client with a pre-populated list of IP addresses (the current server list as of build time) and the client caches the list it gets from DNS whenever it does a lookup. This means that even in the event of a complete DNS failure, even for hours at a time, our clients are still able to connect. This was quite handy when GoDaddy's DNS was inaccessible a year or two ago due to what I recall was a DDoS attack.

A few weeks ago my ISP's DNS servers went down, and since I have the same mobile and DSL provider, I was completely unable to do anything on the internet — except play our game. It was then that I wondered 'why don't more apps do this?' It seems like a simple problem; if you can't do a DNS lookup, assume the previous IP is still valid. Assuming you're using HTTPS, there should be no more exposure from a security perspective unless someone takes control of your IP address and fakes your SSL certificate, at which point you're screwed anyway.

URSpider9412y ago

Our approach has turned out to work really well, and very simply.
We have N servers in round-robin DNS. When our mobile client starts up, it does a DNS lookup, fetches the entire list of servers, and then picks one to connect to. If that connection fails, it tries another one, etc. until it runs out (which has never happened).

The point of the article is that this approach is vulnerable in the case where something about the client request harms the server -- either takes it down or impairs its response. In such a case, a single bad client could rotate successively through the round robin and take out every one of your servers.

The author is proposing a way to minimize the impact of such a bad actor while still providing a form of round-robin failover for well-behaved requests.

mey12y ago

Quick note about client retry, I highly suggest some type of guard on side effecting operations, either rejecting duplicate transaction ids (generated by the client or retrieved from the server before the side effecting operation) or returning the previous result if the transaction id was recently processed.

Obviously being stateful and correctly routing the request across shards becomes harder and can hurt this scale out solution. It also depends on the functionality of the request, for example an update of the same data may not cause any damage, but a double submit of an order could be.

https://en.wikipedia.org/wiki/Idempotence#Computer_science_m...

j / k navigate · click thread line to collapse

10 comments

notacoward12y ago

Great stuff, but the same generally beneficial approach taken too far can run into its own problems.

http://hackingdistributed.com/2014/02/14/chainsets/

I really hope people actually do the math instead of just cargo-culting an idea with a catchy name.

colmmacc12y ago

Author here! Well that's embarrassing, you're right about the 7 million number. I've updated the post to correct it. We'll be following up in a later blog post on the numbers we use for Route 53.

notacoward12y ago

Awesome, thank you. Those of us who waste^H^H^H^H^Hspend our lives pondering these things need to stick together. Cheers.

1 more reply

bmn12y ago

Isn't the number used later in the article incorrect too?

bmm6o12y ago

danudey12y ago

Our approach has turned out to work really well, and very simply.

URSpider9412y ago

The author is proposing a way to minimize the impact of such a bad actor while still providing a form of round-robin failover for well-behaved requests.

mey12y ago

https://en.wikipedia.org/wiki/Idempotence#Computer_science_m...

j / k navigate · click thread line to collapse