GitHub: Recent Load Balancer Problems (RCA) (opens in new tab)

(github.com)

58 pointswfarr14y ago9 comments

9 comments

Reminds me of the last Github post I'd read about their load-balancer setup from their server guy(2009): http://www.anchor.com.au/blog/2009/10/load-balancing-at-gith...

Willy Tarreau (author of HAproxy) had sparked a nice discussion in the comments section, that time.

xtacy14y ago

The post mentions that heartbeats timeout when the load spikes momentarily. I have a few questions, would love to hear answers if it's okay to share :-)

    1. What load spiked?  Is it the network/CPU load?

    2. By spiked (be it network or CPU), do you mean
       the load went all the way to 100%?   Or was it
       some threshold like say 90% of the available
       capacity?

    3. What's the heartbeat time interval?

Thanks, (EDIT: spacing)

jnewland14y ago

1. IO and CPU load spiked so much that the system was basically unresponsive over SSH. We think it was due another Xen VM swapping out of control.

2. Was 10 seconds with a 10 second timeout (way to low to run `xm list` in a loaded situation). It's now 90 seconds with a 90 second timeout.

ww52014y ago

They actually have a pretty good HA setup.

One thing to remember is that a HA cluster is for handling node failure (power loss, faulty hardware, faulty software, etc). It is not for handling capacity related failure. If the servers are overloaded with too many requests, they will fail regardless of the HA setup. Capacity monitoring and capacity planning are still needed to maintain uptime.

seiji14y ago

I rarely see an install of Heartbeat/Pacemaker/CRM preventing more downtime than they cause. If you add in DRBD on top, you get an entire suite of false-HA infrastructure.

stock_toaster14y ago

Just curious, but what have you seen working well?

At a previous gig we used heartbeat with haproxy, and it worked pretty well. We would drop connections on cutover, but it was considered 'acceptable' for our purposes at the time. I wanted to try whackamole with haproxy, but we never got around to it.

seiji14y ago

The only IP failover I trust is carp (http://www.openbsd.org/faq/pf/carp.html) on OpenBSD/FreeBSD. Once set up properly with syncing, you lose no state on a failover (all connection and NAT state is gossiped between cluster nodes sharing an IP address).

The only downside is most services aren't well tested under OpenBSD/FreeBSD these days so you may end up hitting a few edge cases in software designed and tested only under Linux.

1 more reply

ww52014y ago

Really? I've built a MySQL HA cluster using that setup plus DRDB and it work beautifully. Have couple unplanned failure in couple years and they all failed over and came up correctly.

j / k navigate · click thread line to collapse

9 comments

vimalg214y ago

Reminds me of the last Github post I'd read about their load-balancer setup from their server guy(2009): http://www.anchor.com.au/blog/2009/10/load-balancing-at-gith...

Willy Tarreau (author of HAproxy) had sparked a nice discussion in the comments section, that time.

xtacy14y ago

The post mentions that heartbeats timeout when the load spikes momentarily. I have a few questions, would love to hear answers if it's okay to share :-)

    1. What load spiked?  Is it the network/CPU load?

    2. By spiked (be it network or CPU), do you mean
       the load went all the way to 100%?   Or was it
       some threshold like say 90% of the available
       capacity?

    3. What's the heartbeat time interval?

Thanks, (EDIT: spacing)

jnewland14y ago

1. IO and CPU load spiked so much that the system was basically unresponsive over SSH. We think it was due another Xen VM swapping out of control.

2. Was 10 seconds with a 10 second timeout (way to low to run `xm list` in a loaded situation). It's now 90 seconds with a 90 second timeout.

ww52014y ago

They actually have a pretty good HA setup.

seiji14y ago

I rarely see an install of Heartbeat/Pacemaker/CRM preventing more downtime than they cause. If you add in DRBD on top, you get an entire suite of false-HA infrastructure.

stock_toaster14y ago

Just curious, but what have you seen working well?

seiji14y ago

The only downside is most services aren't well tested under OpenBSD/FreeBSD these days so you may end up hitting a few edge cases in software designed and tested only under Linux.

1 more reply

ww52014y ago

Really? I've built a MySQL HA cluster using that setup plus DRDB and it work beautifully. Have couple unplanned failure in couple years and they all failed over and came up correctly.

j / k navigate · click thread line to collapse