If we add nodes to an existing Kafka cluster, those nodes own no partitions and therefore send/receive no traffic. A rebalancing event must occur for these servers to become active. Bouncing Kafka on one of the active nodes is one way to trigger such an event.
Fortunately, cluster resizing is infrequent. Unfortunately, network interruptions are not (at least on EC2).
When ZooKeeper detects a node failure (however brief), the node is removed from the active pool and the partitions are rebalanced. This is desirable. But when the node comes back online, no rebalancing takes place. The server remains inactive (as if it were a new node) until we trigger a rebalancing event.
As a result, we have to bounce Kafka on an active server every few weeks in response to network blips. 0.8 alleges to handle this better, but we'll see.
Handle-jiggling aside, I'm a fan of Kafka and the types of systems you can build around it. Happy to put you in touch with our Kafka guy, just email me (mike.babineau@rumblegames.com). Loggly's also running Kafka on AWS - would be interesting to hear their take on this.