You’ll have to wait for the follow up post with the CNI plugin for the full self-reproducible benchmark, but on a 16 core EC2 instance with a 10Gbps connection IPtables couldn’t do more than 5Gbps of throughput (TCP!), whereas again XDP was able to do 9.84Gbps on average.
Furthermore, running bidirectional iPerf3 tests in the larger hosts shows us that both ingress and egress throughput increase when we swap out iptables on just the egresss path.
This is all to say, our current assumption is when the CPU is thrashed by iPerf3, the RSS queues, the Linux kernel’s ksoftirqd threads, etc. all at once it destroys performance. XDP is moving some of the work outside the kernel, while at the same time the packet is only processed through the kernel stack half as much as without XDP (only on the path before or after the veth).
It really is all CPU usage in the end as far as I can tell. It’s not like our checksumming approach is any better than what the kernel already does.