If you want to play with ELBs, rolling deploys, connection draining to ECS containers, I humbly submit the open source Convox project I am working on.
https://github.com/convox/rack
It sets up a peer reviewed, production tested batteries-included VPC, ECS, ASG, ELB, etc cluster in minutes.
If the conclusion of this Sysdig post was that you always need to run 2 instances per AZ for the best reliability, I would strongly consider adding that knowledge into the tools either as a default or a production check.
Since it sounds like an ELB bug I'll keep the 3 instances in 3 AZs default.
Of course the former is very common with Auto Scaling Groups [1] [2]. Then you can use round robin or session sticky routing algorithms in the load balancers.
(Apologies if I'm totally off-base for what you were asking.)
1: http://docs.aws.amazon.com/AutoScaling/latest/DeveloperGuide...
2: http://docs.aws.amazon.com/AutoScaling/latest/DeveloperGuide...
There's another case that the article doesn't really discuss (though the evidence of it is in the beginning when all connections drop simultaneously) where the ELB nodes themselves scale vertically at a particular threshold. I believe the setup described is still vulnerable to those scaling events.
The other thing to consider when deploying to the cloud with load balancers is to use an immutable architecture. Taking hosts out of service, updating them, and putting them back in service is a bit cumbersome at best and leaves you vulnerable to service outages.
1) If you are at the scale of deploying several times an hour, the instance hour cost would probably look like a rounding error for your entire AWS spend, I'd imagine.
2) At that cadence you'll definitely benefit from using containers and a container scheduler (Kube, ECS, etc). Reuse the infrastructure but redeploy your apps to your hearts content.
AWS Support indicated that this was a feature of the new NAT Gateways, even though it breaks outbound connections made by popular implementations such as the Requests python library's urllib3 connection pools. This is pretty unfortunate, and has been a roadblock in migrating to the NAT Gateways.
Thanks for the pointer to urllib3 - we'll take a look at it and see if there's anything we can do about the behavior. One of the challenges with sending "FIN" on timeout is, as you write ... it closes the connections cleanly.
Some TCP based protocols (Including even HTTP in some modes) use a successful connection close to indicate that an object has been transferred fully; so what we've seen is that a network connection may stall (internet packet loss for example) ... then the connection eventually times out ... and the "FIN" falsely conveys that the entire object has been transferred. The end result is a truncated object, which is no good either.
We update existing instances because in our test environment we deploy at every single new commit (we absolutely love that), and we have hundreds (or more) a day. At that pace, replacing instances would be more time consuming (again, for our specific use case) and less cost efficient.
Plus, updating existing instances is handled automatically by AWS Code Deploy, which provides a very good deploying pipeline that you can control using the aws cli tool.
There are other minor advantages but those are the two main ones.
Does something verify every commit in the testing environment too?
However best practices always evolve...
I'd say that rolling out containers on ECS is starting to really show advantages.
It is now generally:
- easier to build and push an image than burn an AMI - faster to boot a container than an instance - faster to finish a deploy with options like min containers in service and a slack instance or two
To be honest most teams don't actually need the extra agility that containers promise.
But if I was starting an AWS setup from scratch I'd strongly consider containers on ECS.
In addition to the speed there is more portability with containers and a whole new generation of tools coming in the ecosystem.
But when it doesn't, debugging might actually be simpler with less black boxes between you and the metal.
The author mentions WireShark - fun fact: the founder of Sysdig, Loris, is also the creator of WireShark.