The trick is that we have "pretty good" results for TTS as-is, but it has significant shortcomings that are more visible in certain use cases. The operative word is "prosody" - the cadence, rhythm, and pauses that are natural when speaking that are heavily dependent on context and content.
Prosody is incredibly important to making natural utterances - TTS models that do not model prosody end up sounding very "flat", which is mostly all of the heavily used TTS engines out there right now. This is less glaring for short responses like what you would get from a voice assistant, but becomes a huge grating problem when you try to do long-form text reading.
The trick with prosody is that it often requires information and context not contained in the text to be read. You would apply a different rhythm and stresses to a horror story than you would to a conference keynote speech, for example. It also requires a more sophisticated understanding of the content of text rather than simply its constituent words, in order to figure out proper stresses and pauses.
All of this is eminently solvable (as demonstrated here with the book voices) but is... rather difficult. I suspect we're not terribly close to a product where you can just feed it raw text (with annotating or otherwise providing additional data as context) and get a great result.