> I know Bill carried within him deep currents of spiritual yearning that he found easiest to express through the beauty he saw in all places wild.
(I checked in an online sample of the ebook: there is no punctuation in this sentence.)
Unfortunately the AI completely faceplants, placing an enormous pause right in the middle of the phrase "all places wild". It actually changes the meaning of the text, making it sound more like "...the beauty he saw in all places. Wild!"
I wonder if any of these AI speech synthesis tools come with an editing tool that you could use to tell it not to put the pause there.
I don't feel like this is a product for carefully producing audiobooks, but to create them by the pound, so to speak. I'd say it's a move for the "make your own business through audiobooks" people [1] -- very strange for Apple.
[1]: I didn't know this audience existed until I saw this video on it (and the cons that happen) from Dan Olson: https://www.youtube.com/watch?v=biYciU1uiUw
But that's the thing...is "perfection" worth 3 hours of video editing for something you casually consume?
I think almost any audiobook listener will vastly prefer a serviceable but imperfect audiobook when compared with no audiobook at all.
Now human language, limited as it already is, is to be is to be humbled before machines that humans have also invented. In our inability to create a machine capable of doing cognitively what humans do, we prefer that humans function as if they had been lobotomized, in deference to our crude machines.
We have built god, and god is stupid, and we bow before him because god has been created, once again, in the image of man.
I disagree for 2 reasons:
1. There's a perfectly fine reason to put a pause between "places" and "wild": to put emphasis on "wild". Bill doesn't see beauty in all places, but specifically in all wild places.
2. Interpreting the narration as "[...] all places. Wild!" is farfetched because the narrator pronounces "wild" very calmly and softly.
I agree the pause is a bit too long, but I was expecting way worse when I read your comment about how "the AI completely faceplants".
1. A long pause between "places" and "wild", to me, signals their dis-association, that "wild" does not link with "places". However, the lack of punctuation in the written text implies the phrase "all places wild", the "all wild places" you refer to. I'm with the GP here, the AI didn't convey the meaning I'd expect from the text.
2. Also, the preceding text seems to discuss a certain "ineffability", a spiritual/magic in the world that seems diffuse, broad and subtle. With that context, pronouncing "wild" calmly and softly ties it to the earlier ineffability rather than the more discrete "places". Again, this reinforces the "...places. Wild!" interpretation. I am very impressed the AI used two or more modes of expression (time, tone) to express a feeling... but I disagree that's what the text held.
Maybe the AI's smarter than me.
But this excerpt is the end of a paragraph that begins with "Bill loved and found solace in nature." and describes taking walks and looking at the moon. This doesn't support emphasizing "wild" because the author has already established that information; the important part of the sentence is "deep spiritual yearning", or maybe "easiest to express", since the author then goes on to discuss how, after he died, Bill expressed himself from beyond the grave in other ways.
I could kind of understand the AI not quite getting the emphasis right, since that's a judgement call that requires a lot of context from the rest of the book. But breaking up "all places wild" like the sample does suggests that it doesn't understand the basic grammar of the sentence.
I wonder if this is because it's difficult work, or if the tools aren't user friendly enough to put in the hands of untrained users. If it's the latter I suppose that sooner or later we won't need to go through the partners.
A panda walks into a café. He orders a sandwich, eats it, then draws a gun and fires two shots in the air.
"Why?" asks the confused waiter, as the panda makes towards the exit. The panda produces a badly punctuated wildlife manual and tosses it over his shoulder.
"I'm a panda," he says at the door. "Look it up."
The waiter turns to the relevant entry in the manual and, sure enough, finds an explanation.
"Panda. Large black-and-white bear-like mammal, native to China. Eats, shoots & leaves."
https://en.wikipedia.org/wiki/Eats,_Shoots_&_LeavesBut I am also saddened at a future where all this is locked up in corporate hands - obviously there is money needed and (licensed) data needed too which Apple can get at.
Honestly I would rather eschew the ethics of it and just consume any and all voice data (youtube, podcasts, existing audiobooks, radio) that has transcripts available, perhaps because I assume corpos are already doing this, if it means we can have a free and open data model that people can run at home, maybe that makes me evil.
I would guess this rolls out from big companies first because the first version is always the most difficult. It’s only going to get easier to do and I would totally expect end-user controlled TTS systems to get better and eventually exceed the capabilities of this version from Apple. Of course Apple isn’t going to sit still, so they will continue to improve as well.
Are there examples from ten or twenty years ago of a technology that big companies had locked up that never made it out to end users? What we have might lag, but it seems like this stuff only ever gets easier to do.
Will be interesting to see how this develops. Either independent digital narration becomes competitive enough that a publisher simply gets it done once and then sells it on all platforms, or this new platform-exclusive model is so disruptive that it becomes even less economic to produce a Audiobook, effectively making Audiobooks exclusive to Apple and Amazon/Audible (and whoever else has such a digital narration engine).
Even Stable Diffusion was only 600k which is hardly outside the reach of a startup. The only ridiculously expensive models reserved for the big end of town are the GPT3 etc language models, and I fully expect the data/compute requirements to come down considerably in the near future.
https://github.com/coqui-ai/TTS
You can also clone someone's voice by finetuning a pretrained LJSpeech model and training a vocoder from scratch, I've had great success with as little as 15 minutes of speech.
EDIT: "fine voices" not "find voices"
Are you able to point to any articles to help get started with this please?
They will use this technology to save money on human speakers. If they release it into the public domain we'll end up with ebooks that can read themselves aloud and they'll lose part of the incomes from audio books.
My Samsung phone can read ebooks with one of Samsung's voices right now, but it does an awful job at pauses. Basically, no commas. With a good voice I could turn each one of my ebooks in an audiobook.
Edit: to be very specific, a really good voice actor will take on different voices depending on which character is speaking, and will act out scenes realistically. I honestly can't imagine any AI being able to do that.
While there is/was a voice synthesis project at Mozilla it was rudimentary like 3 years ago
My guess is that these generated voices are far from perfect and someone has to go in and crank the algorithm to get a fair number of passages to not sound strange.
Even in the example Helena there is a word at the end of a sentence that sounds like it should be in the middle and has a bit of weirdness to it. Still, very impressive, I think better than I remember Amazon Poly sounding.
This thing by Apple already sounds way better than the best I heard previously (NextUp Ivona) but it is not an instant-result offline tool yet and that's sad.
The trick is that we have "pretty good" results for TTS as-is, but it has significant shortcomings that are more visible in certain use cases. The operative word is "prosody" - the cadence, rhythm, and pauses that are natural when speaking that are heavily dependent on context and content.
Prosody is incredibly important to making natural utterances - TTS models that do not model prosody end up sounding very "flat", which is mostly all of the heavily used TTS engines out there right now. This is less glaring for short responses like what you would get from a voice assistant, but becomes a huge grating problem when you try to do long-form text reading.
The trick with prosody is that it often requires information and context not contained in the text to be read. You would apply a different rhythm and stresses to a horror story than you would to a conference keynote speech, for example. It also requires a more sophisticated understanding of the content of text rather than simply its constituent words, in order to figure out proper stresses and pauses.
All of this is eminently solvable (as demonstrated here with the book voices) but is... rather difficult. I suspect we're not terribly close to a product where you can just feed it raw text (with annotating or otherwise providing additional data as context) and get a great result.
The weak link was the available free/open datasets. You needed a single speaker with a pleasant voice, 20hrs+ material from varied sources, recorded in a good recording enviroment with a good mic etc. For English, the go-to was LJSpeech, which doesn't fulfill all these requirements. I say 'was', as I haven't followed developments recently.
Last year we decided to make our own dataset with a Irish woman, Jenny. She has a soft Irish lilt.
Never got around around to training the model, but I will upload the raw audio and prompts here in a few hours (need to pay my internet bill in town..):
https://github.com/dioco-group/jenny-tts-dataset/blob/main/R...
With generative visual art, people usually spend considerable time fine-tuning the results, and we don‘t get to see all the prompts that didn‘t work out (except if the failure is notable in some way).
Try e.g. illustrating a book, but using only your first prompt for each image. I think the quality would be in the same ballpark as having Siri narrate the corresponding audiobook.
Stable Diffusion / Midjourney etc look really pretty to the average person but on closer inspection they rarely hold up out of the box. If you’re an experienced artist you pick up on all the flaws right away.
ChatGPT and Copilot are similar. The answers seem confident , but the more familiar you are with the domain of the answer, the quicker it becomes to see how flawed the results are.
Now going back to TTS. You’ve spent your whole life knowing what speech sounds like. Unlike those other models that require an extra level of domain knowledge, everyone innately knows the sound of humans speaking. So you’re effectively, and subconsciously, a domain expert.
This is essentially the uncanny valley effect but for other areas.
I think the brain is just more sensitive to speech, because inflection and tone is a key part of communication. So even subtle artifacts in the generated voice are really obvious and annoying.
Plus, as another commenter mentioned, books are long. An issue in 1 out of 10,000 words will be enough to break emersion.
Define perfect ;) Two different people will read the same text slightly (or not slightly) differently.
A great example is this brilliant and funny rendition of "To be or not to be" by Tim Minchin, Benedict Cumberbatch, Judy Dench, David Tennant and others. Sorry for the Facebook link, but it's very hard to find this video anywhere: https://www.facebook.com/watch/?v=585252039999241
I think one differences with pictures and audio is that pictures are two-dimensional and we can't take in the whole image at a time. This makes it easy to overlook flaws without careful inspection. And I find that although there has been some amazing AI-generated art, there are still a lot of rough edges and tweaking required to get really clean images.
As far as writing goes, I suspect that the rules of written language are easier to learn and violations easier to overlook than with generated audio.
Yeah, right. What a lame attempt to deflect the (fully warranted) criticism that this will put audiobook narrators out of work.
The fight now seems to be whether this transformation happens only in production, or companies like Apple succeed in breaking the total Audiobook price apart into "license" and "production", only buying the license and have the production done on their proprietary servers.
Overall, I agree it's inevitable that this results in a sharp decline in professionally narrated Audiobooks...
Or, it will increase demand for audiobooks so much that more humans are needed to create top-notch audio.
Until AI is so good that it can mimic emotion, I think there will be a market for human narrators. Of course it will be smaller than what it is now, but I think people will specialize.
There are so many books I have that don’t have an audiobook version because the economics just aren’t there.
This is an easy way that technology can expand human experience.
Even in situations where the author reads the book, I expect it will be cheaper to train an AI to sound like the author than to put the author in a studio for 50 hours (or whatever).
I thought it was a really dumb ruling when Amazon was forced to remove the text to speech function from kindle.
I also think that screen readers are hobbled to avoid this legal issue. I want to send any text through a narrator bot and have it read it to me. There is zero need to compensate anyone other than the developer who writes the AI (and hopefully it will have open source versions donated by developers).
If I’ve bought a book, I should be able to use it as I like.
The current iteration of this technology is not competing with truly great narrators, like Tom Hanks or Jim Dale.
An AI TTS engine at this level would do a far better job of it than that particular dude.
TBH, human narrator on Audible sometimes just reads the stuff aloud
https://www.consumerreports.org/digital-assistants/apples-ne...
It was killed by publishers who wanted to charge separately for audiobooks.
If Apple has somehow managed to get the licensing for this, I might consider buying from Apple Books in the future.
The Author still needs to hold the rights for Audiobook production, and he needs to license a third party to produce an Audiobook (no matter if human or "digitally" narrated).
I guess that's why this is aimed at "independent Authors", to circumvent negotiating Apple's rev.share and exclusivity for that production with established publishers...
Any sources on this?
See also the recent lawsuit covering the other direction, automatic transcription of Audible books. https://www.geekwire.com/2020/amazon-owned-audible-major-pub...
Author might sell the book rights to company X and audiobook rights to company Y. Company X can't do a text to speech version of their book without infringing on Y. Y cant do speech to text of their version without angering X.
Licenses are fun!
There are certain books that I think I'll always buy the non-AI variant because narrators can bring more than natural reading, they sometimes bring different characters (sometimes more feminine, more baritone, more stereotypical accents) -- and I would melt if AI could do that kind of voice acting.
There is way more good audio content out there than I have the time/interest to listen too and I can't believe I'm that atypical. And a book is a relatively big listening time commitment. I'll happily pay a few dollars more for a good human narrator.
First, the last few years have seen a race to the bottom for narrator rates, since during the pandemic it was recognized that it's a job that can be easily done from home, literally from anywhere in the world.
Accordingly, the up-front cost for an average quality 10 hour book is only about $1,500, and can be turned around in under two weeks from a human. If you get a really good and well known narrator, it's still only about $4,000 (and you'll probably get it quicker).
Also, they're going to be competing against revenue share models from Amazon/Audible, which basically means it costs the author nothing up front. Amazon's bite out of audiobook sales is absurdly high (60%), so other companies could (and are) definitely improve on that. It's mostly a fight against Audible's brand at this point.
But back to AI: AI narration is going to have to compete against humans willing to do a lot of work for very little pay. I'm honestly not sure the compute and QA costs will be competitive. And frankly, even if it is cheaper, it's not as if those savings will be passed back to the customer.
If you'd like to look at how little it can cost to get a human to do voiceover work, check out fiverr.com and look for voice actors and narrators.
That doesn't really surprise me. On the flip side, I can get high quality transcriptions for $1/minute (given good audio quality).
People, even those with better than average talent at some things, just often aren't that expensive. I suspect the same is true for some of the generative AI tasks that people are all excited about--new grad English majors are pretty cheap, especially if they can be assisted by search/generative AI.
I hope that the results from this type of production are clearly labeled as computer generated in the store. I don't think putting "AB Apple Books" is clear or sufficient, for someone that doesn't know about this tech "AB" sort of looks like a placeholder for some unnamed human.
(But totally agree with you that this isn't going to replace a good human narrator.)
In times of flat-fee Audiobook platforms the pressure to bring down audiobook production costs will only increase, funding a full-fledged Audiobook production for each book will only become harder to justify.
Moreover, looking at what Apple describes here, they seemingly want to establish digital narration (quality) as a metric for competition between Audiobook marketplaces, not publishers. So if this works out, the major platforms will compete on digital narration and publishers will have less incentive to actually produce an Audiobook with human narrators...
I know nothing about the economics of audiobooks. And will note that there are free public domain audio books already https://librivox.org/. But TTS will improve and, at a minimum, improved TTS will be a benefit for people who can't read for various reasons.
See full script here: https://gist.github.com/ivanistheone/de3ccb244224d101bb93320... and this doc explains how you can setup a keyboard shortcut to turn any text selection into an audio book https://docs.google.com/document/d/1mApa60zJA8rgEm6T6GF0yIem...
Here is a sample if you want to hear what it sounds like: https://minireference.com/static/tmp/constructive_feedback.m...
which is the audio from this blog post https://productivityhub.org/2019/04/19/how-to-deliver-constr...
IMHO, the computer generated voice like Alex (the default voice on mac OS) sounds better because it doesn't try to do inflections or add human character when it is reading. The real-world narrators (voice actors) seem to add too much "character" into their reading, which me distracts from the story/content. The only exception is when the narration is done by the author, in which case I'd consider the narration as part of the work.
I'm normally able to follow narrative (both fiction and non-fiction) that has something to teach, and also enjoying listening to classic literature no problem...
But sometimes I'm reading a long article from the internet and I experience what you describe (losing track of what author is saying, having to rewind to get the point). After a while, I realize it's not the computer's fault, but the article is just very low content (e.g. some authors just pile on words, emotions, opinions without a coherent narrative or point). Recently I noticed I'm able to detect GPT-generated text this way too... words without content or message.
Perhaps the monotone TTS can be a test for the "meaning" contents of a text.
#!/usr/bin/env bash
echo "Downloading mp3 from $1"
yt-dlp -x --audio-format mp3 "$1"
You'll need to install https://github.com/yt-dlp/yt-dlp#installation before you can use that. As you can see, the "script" is just so to add a options `-x` (extract audio) and `--audio-format mp3` to convert to mp3 in the end.https://git.herrbischoff.com/awesome-macos-command-line/abou...
Edit: Adding a few more details to my thoughts to say why it's boring. Good narration is so much more than correct pauses. Pacing. Emotion around words like death and life. Ensuring that sentences don't repeatedly end on the same inflection tone. Modulation of rhythm. None of that is there.
The last time I ran into this was when a known person started a youtube channel where they put together the script and the video and then used an AI to narrate the script. I assumed it was an AI because I figured that's how said acquaintance would have managed the budget. But it was incredibly tedious to listen to. You can see this in work here (https://www.youtube.com/watch?v=yWVvmKpCBDg). Has the same feel of the Apple digital narration. I don't know how I could listen to that easily for over an hour.
On iOS/macOS, VoiceDream has offered flexible apps with voices in multiple languages and accents since 2012, e.g. for reading PDFs, web, non-DRM ePub books and scanned text, https://www.voicedream.com/about/.
"Mitchell sounds like Ray Porter" is more accurate. Accent is completely different, so they don't sound exactly alike. My first impression was that Mitchell sounds like a Clay Jenkinson,[1] but more a cross between Jenkinson and a male newscaster I can't place who is probably retired now, but who also narrated documentaries.
According to a Reddit comment [1] it is, but they haven’t posted their source.
[0] https://twitter.com/pwnies/status/1610857711008370688?s=46&t...
[1] https://reddit.com/r/apple/comments/103iogu/_/j305eby/?conte...
Correct one is https://twitter.com/Ray__Porter
Was always frustrated that Kindle was barred from reading books, it is such a natural progression of capabilities. Leave up to the buyer to decide if they want to pay for the person, but default TTS should be allowed for all books, such that if I read book at home and then can continue listening during a walk.
I have my biases. My wife and I have licenses to listen to about 500 Audible audio books and in the best of them I feel like I have a human to human relationship with the narrator that is similar to a relationship with the author.
I have mostly worked on deep learning projects over the last eight years, so I appreciate the tech as an engineering tool, but I think it is important to view tech as a servant to human experience.
Its like an Lexus vs Hyundai.
https://en.m.wikipedia.org/wiki/Buffalo_buffalo_Buffalo_buff...
They don't appear to be making any attempt to have the narration use inflections for different characters. This is probably fine for nonfiction books, but for fiction books, it can make it really hard to follow when a narrator does this, at least for me.
Most answers did not age well I would say.
That said, I suspect it’s less of a technology issue and more cultural: the current generation growing up on robot audio will have different expectations to previous generations, the lack of humanness probably isn’t an issue for them, so even if the technology can never exactly replicate the humanness of audiobooks, it may not matter.
(The text-to-speech on TikTok, for example, is often lampooned by people not of the TikTok generation for being disconcerting and annoying, whereas young people seem to have no problem with it… and that voice is much more artificial sounding than these Apple examples).
I don't think publishers will stop hiring professional narrators for the bestsellers (instead of the long tail that has been text-only until recently) for a few years more.
I also remember reading, a couple of years ago, that Apple was working on improving the voices for Siri - resulting in me thinking "surely Apple has more important things to work on to improve Siri". I guess this was what they were actually aiming for.
My friend had a copy of the scripts from the radio series and we used to use it as inspiration - "open the scripts at a random page and our band name will be the first thing that Marvin says ... Zootlewurdle - perfect!"
I've got them on a set of CDs somewhere, but have nothing that will play them...
I would imagine it was the other direction—that this is a way for them to test out their improvements to Siri.
It was a famous author too.
And now this announcement.
Edit: Previous discussion on HN: [1]
[0] https://winteriscoming.net/2022/12/30/brandon-sanderson-blas...
I sure hope that he negotiated a gigantic amount for his data/training set provided to Apple, as this tech sounds like it’s getting advanced enough to obviate a giant chunk of the narration business overnight.
Also, having the progress sync between digital text book and audio version is a great UX improvement!
Because the contracts between Amazon and the publishers do not permit this kind of work.
Like, this is (legally) a pretty open and shut issue. Amazon has a license from the publisher for the book that permits a certain range of activities that are contemplated by the contract: showing short excerpts for marketing for example.
IP licenses are carefully constructed, often because the rights are sold to different parties. You may for example license a book to adapt into a movie, but the contract would likely forbid you from adapting it for a TV show. The publisher may sell those two rights separately to two different parties.
And these contracts likely either specifically forbid constructing an audiobook (automated or otherwise) from the original book, or at least do not contemplate it. That is a clear source of lawsuits where Amazon is likely clearly in the wrong.
Another reply here mentions that the publishes do not "like" that Amazon went ahead and did this without consultation. That may very well be true - but more importantly (and constructively) it's likely that Amazon's behavior is specifically forbidden in the contract they voluntarily signed on to with the publisher.
As a user focussed feature, it could read any audiobook out loud, and would differentiate apple books from any other audiobook platform.
I guess it's aimed at authors, because then they can charge the author for the 'narration' service....
Licensing. Audiobooks are a different license from ebooks, and trying to narrate an audiobook will infringe the licensing terms.
No author is going to win a twitter flame war because they don't want a 'speak it out loud' button provided by apple on their ebooks.
Besides - all ebooks on Apple platforms already support this via:
Settings → General → Accessibility → VoiceOver → ON turns on the VoiceOver feature
This is just a higher quality version of the same.
It seems Apple is trying to get the audiobook license directly from those authors who didn't sign the license away yet, undercutting production cost for the Audiobook with "digital narration" and then earning more money per sale...
I guess we're going to see human narration die very fast now, at least for some common languages, and tech.companies want to ensure that they can split license from production cost, instead of being forced to buy the "whole" Audiobook...
Cherr-picking is easy, but I paid for this, it needs to be human quality throughout
Yes, it needs to be tens of hours of perfectly good narration.
I suspect it’s to avoid labeling the speakers as “male” and “female.” What a joke.
I wonder how this works with her ???.
So it's also platform capitalist moat building, too - i.e. a scheme to deprive Amazon Audible of audiobooks. The more publishers opt to use Apple Books digital narration instead of paying a narrator, the less audiobooks will be available on Audible. And yes, you are allowed to still pay a narrator and distribute that recording on Audible, but... if you could do that, then obviously you wouldn't bother with Apple's TTS system.
Of course, the flipside of this is that Amazon refuses to bother with copyright enforcement for books not on Audible. Cory Doctorow found this out the hard way[0]. If you do not license your work to Amazon, Amazon will pay someone else to copy it, and for some reason DMCA 512 protects them[1]. So I can see this winding up being a functionally unused service anyway.
[0] https://www.audible.com/pd/Why-None-of-My-Books-Are-Availabl...
[1] To be clear, I do not oppose DMCA 512; I just don't think DRM-bearing audiobook services that charge money should be allowed to disclaim copyright liability. DMCA 512 and 1201 should be mutually exclusive.