Willy Tarreau (author of HAproxy) had sparked a nice discussion in the comments section, that time.
1. What load spiked? Is it the network/CPU load?
2. By spiked (be it network or CPU), do you mean
the load went all the way to 100%? Or was it
some threshold like say 90% of the available
capacity?
3. What's the heartbeat time interval?
Thanks,
(EDIT: spacing)2. Was 10 seconds with a 10 second timeout (way to low to run `xm list` in a loaded situation). It's now 90 seconds with a 90 second timeout.
One thing to remember is that a HA cluster is for handling node failure (power loss, faulty hardware, faulty software, etc). It is not for handling capacity related failure. If the servers are overloaded with too many requests, they will fail regardless of the HA setup. Capacity monitoring and capacity planning are still needed to maintain uptime.
At a previous gig we used heartbeat with haproxy, and it worked pretty well. We would drop connections on cutover, but it was considered 'acceptable' for our purposes at the time. I wanted to try whackamole with haproxy, but we never got around to it.
The only downside is most services aren't well tested under OpenBSD/FreeBSD these days so you may end up hitting a few edge cases in software designed and tested only under Linux.