In June of this year, my entire team was struggling after moving to a new Kafka cluster because of low throughout in one of the backend services (about 250k/min with approx 32 instances). This was causing issues for our downstream dependencies as we were not able to reply back within one hour of consuming the record from upstream and our L2 support was daily getting numerous pages which were also escalating to us. Then my manager asked me to take a look at what is going on and if we can improve the throughput somehow. After almost spending one week of banging my head against different theories and tons of experiments in QA I finally figured out that the new Kafka client we were using had a setting where it required acks from all brokers (which has increased to 5 in the new cluster from 3 in previous one) and this caused a huge increase in the blocking time even though we were using an Async framework. The async task just waited too long for completion and once completed could not get the threadpool back due competition with other threads. Solution: simple, I just changed the Kafka producer ack from "ALL" to "ONE" requiring acknowledgement from one broker only. Throughput with same 32 instance jumped from 250k/min to around 700k.
I ask the intelligent readers of HN - do you think that based on this I should be penalized since the change was only in one line of config code? Yes, that's all what it took to resolve this outstanding issue - one line of config change/one commit albeit one week of thinking and experimenting time!