You use DNS failover and multiple load balancers.
FOO.COM A record -> 1.2.3.4, 1.2.3.5, 1.2.3.6
Then at 1.2.3.4, 1.2.3.5, and 1.2.3.6 you put a load balancer that splits loads between all of your clients.
Any LB goes down, and DNS client retries will deal with it. If any backend server goes down, your LB will deal with it.
Using this pretty successfully at digital ocean right now. What is the downside? I guess client DNS retries takes a few seconds, but for a rare case of a load balancer dying, seems not a deal breaker.
Your browser will also cache the result of the DNS lookup, and if that server goes down it will not try to do another DNS lookup for another host and your service will be unavailable.
It will also be unavailable for any new customer that gets the "faulty" IP address.
Specifying multiple DNS records will just cause your DNS server to use one of those, usually in a round robin fashion.
I am simply giving a list of servers that can answer a request.. clients know to keep trying till one works. (Which they all do. Try it!)
Among other things, IE7 will pin IPs for 30 minutes, non-browser clients may have serious issues, etc.
In my use case, I don't support IE7 (won't work at all on my SAAS app), and I only support browser clients.
I have simulated LB failures by killing nginx, and watched traffic flow over to the other LB without a big delay (in 30 seconds everyone was over).
Fancier IP failover is nice for sure, and would let some more enterprisey people in.. but for a lot of apps out there, DNS failover works great. Surprised from above how many people don't realize it exists or works so well (for so little effort).
What, regardless of TTL? That's gross.
Have you tested this in all the browsers? According to this ServerFault post[1], it could take minutes for an IP address to be considered "down" in Chromium before it cycles to the next one; Firefox apparently waits 20 seconds[2]. Those posts are dated 2011 but I can't imagine the behavior would've changed a whole lot since then. A user is not going to wait multiple minutes or even 20 seconds for a web page to render - it's effectively down.
IP failover with heartbeat or keepalived seems like a much better solution to me when feasible.
Hacker news takes > 20 seconds to load all the time. You mash reload and go on with your life.
I think people get too hung up on "I must have the most optimal HA setup in the history of the world" they end up having no HA, or spending thousands upon thousands of dollars to make some elaborate AWS rube goldberg device that lets you checkoff a bunch of HA boxes. I know a lot of people who did that, and their fancy AWS HA contraption totally fails in the real world because the entire US-EAST region went down and operation depending on at least one availability zone of it working to stay up. Look how much effort Netflix puts into HA, and how many hours a year they are totally broken.
For each application, you have many competing desires. You can have a HA website or web application without using IP failover. IP failover is cool, but not without it's own problems. Every solution has pros and cons. DNS round robin is not a bad solution for many classes of apps that want dead simple failover.
How? How does the DNS client know that the IP no longer works? do browsers today have this mechanism?
I'm not a network guy so perhaps I'm wrong but it's my understanding the problem with DNS load balancing is that you can not invalidate the TTL on the client.
TTL does not matter here because I am not yanking or adding to my DNS record. I am simply saying "Here are 3 servers.. try them in order until you find one that works".
In practice, a helpful feature is
a) Most clients try them in order from top to bottom b) Most DNS servers (including Digital Oceans) randomize the return order.
So if you do 2 dns requests, the first will return 1.2.3.4, 1.2.3.5, 1.2.3.6, and the second will return 1.2.3.5, 1.2.3.6, 1.2.3.4
This has the double benefit of splitting traffic more or less evenly between my load balancers, and dealing with things with one or more is dead.
You can simulate IP failover with something like Elastic Network Interfaces / Elastic IPs in AWS... it's just not going to be on the same level of speed as doing it in, say, your own rack in a datacenter. It's also subject to weirdness where you could have some sort of split brain, nodes trying to take over interfaces in a loop. The health checked "multiple load balancers behind a single DNS record" approach has flaws but also simplifies a lot of things.