The only reason why you can claim 9x latency is because you've saturated the worker threads. You should still win on latency even if it were properly bottlenecking on the network, but 9x throughput and 9x latency is completely false as a capacity limit in this test.
The other issue is 100 bytes isn't typical. It's common but almost every user has a varied workload. Deploying FPGA's for the larger cache values ends up being a waste. I designed a new storage system based off of offloading larger cold keys to flash, even.