I think the brain is just more sensitive to speech, because inflection and tone is a key part of communication. So even subtle artifacts in the generated voice are really obvious and annoying.
Plus, as another commenter mentioned, books are long. An issue in 1 out of 10,000 words will be enough to break emersion.