I believe there’s work to minimize this using io_uring so that you can talk to the fuse driver without the kernel being in the middle, but that work isn’t ready last time I checked.
For what it’s worth at Palm we had a similar problem because our applications were stored compressed but exposed through fuse uncompressed, instead of O_DIRECT I just did an fadvise to dump the cache after a read. Not as high throughput but the least risky change to get the same effect.