The two articles on your epoll command queuing and prefetching have a lot of similar observations to the BP-Wrapper approach [1], so you might find that to be an interesting paper to read. That is used by Caffeine cache [2, 3] which uses a concurrent hash maps with lossy striped ring buffers for reads and lossless write buffer to record & replay policy updates. On my M3 Max 14-core, a 16 thread in-process zipf benchmark achieved 900M reads/s, 585M r/w per s, 40M writes/s (100% hit rate, so updates only). Of course the majority of your cost is your I/O threads but there are a few fun ideas nonetheless.
[1] https://dgraph.io/blog/refs/bp_wrapper.pdf
[2] https://highscalability.com/design-of-a-modern-cache/
[3] https://highscalability.com/design-of-a-modern-cachepart-deu...