These extant solutions require an effort-investment from the user to work up to fast speeds, but once the user becomes acclimatized they work great. The neuroplasticity of the human brain seems to do a great job of smoothing out the wrinkles.
I've experimented with some of tacotron TTS/espnet to do the TTS on my computer and they work alright. Sometimes you get weird edge cases and it makes some pretty weird sounds (and even if your laptop doesn't have a GPU, google co-lab works well for quick audiobook generation). I don't hit the million characters that often so it hasn't been a big deal but I'll probably move to home-made just because I like tweaking it.
The way I think about it is that the written word doesn't have much intonation anyway so as long as the audiobook doesn't offend me, it's a pretty good solution (and helps prevent eye strain after working on a computer all day)