Apple Books digital narration (opens in new tab)

(authors.apple.com)

300 pointsalienreborn3y ago234 comments

234 comments

The "Helena" sample contains a pretty good test case of this system's ability to guess where emphasis and pauses go:

> I know Bill carried within him deep currents of spiritual yearning that he found easiest to express through the beauty he saw in all places wild.

(I checked in an online sample of the ebook: there is no punctuation in this sentence.)

Unfortunately the AI completely faceplants, placing an enormous pause right in the middle of the phrase "all places wild". It actually changes the meaning of the text, making it sound more like "...the beauty he saw in all places. Wild!"

I wonder if any of these AI speech synthesis tools come with an editing tool that you could use to tell it not to put the pause there.

cprecioso3y ago

I guess this is more of a case of garbage in - garbage out. The original sentence itself is not well-structured. But people who write like that and don't edit it; won't care of how the AI reads it.

I don't feel like this is a product for carefully producing audiobooks, but to create them by the pound, so to speak. I'd say it's a move for the "make your own business through audiobooks" people [1] -- very strange for Apple.

[1]: I didn't know this audience existed until I saw this video on it (and the cons that happen) from Dan Olson: https://www.youtube.com/watch?v=biYciU1uiUw

leokennis3y ago

To me this feels like the automated video slideshows that Apple Photos (and undoubtedly Google Photos) makes for you. Perfectly fine, but indeed even on a casual watch you notice mistakes/imperfections you'd never make if you were producing such slideshow manually.

But that's the thing...is "perfection" worth 3 hours of video editing for something you casually consume?

I think almost any audiobook listener will vastly prefer a serviceable but imperfect audiobook when compared with no audiobook at all.

klondike_klive3y ago

I find those slideshows unintentionally hilarious - ten photos of my kid interspersed with a flash photo of the back of the washing machine and some cable that's hanging down underneath my car, an accidental screen grab of a text message, all treated with the same importance and jaunty library soundtrack.

1 more reply

Wowfunhappy3y ago

> But that's the thing...is "perfection" worth 3 hours of video editing for something you casually consume?

Any book of any length took countless hours to write and edit. Yes, I think it's worth a bit of extra time for a human to go through and read the thing aloud.

If the alternative really is no audiobook at all... okay, I guess something is better than nothing. But on the whole, I'd like publishers to just record more audiobooks, and I'm concerned this technology will result in fewer "real" audiobooks being produced.

paulryanrogers3y ago

Agreed. I bought an old Kindle for traveling because I could use its TTS to listen to ebooks. Now that's a common feature of most reader apps, yet at the time it was rare and even newer Kindles had dropped TTS.

ClassyJacket3y ago

I can understand that sentence fine, I don't see anything wrong with it.

dbspin3y ago

Writer here, it's a terrible sentence.

"I know Bill carried within him deep currents of spiritual yearning that he found easiest to express through the beauty he saw in all places wild."

"Deep currents of spiritual yearning" is both cliched and unspecific / unclear.

"That" is superfluous.

It's unclear how Bill is expressing the beauty he sees, and the sentence structure implies he's somehow responsible for the natural beauty.

"All places wild" is wilfully awkward and anachronistic.

There are countless better ways to say the same thing. For example the tone would be similar and the sentence more concise just to say:

'When Bill spoke of the beauty of nature, I could sense it inspired spiritual feelings in him.'

Or simply: Natural beauty inspired in Bill a yearning for connection to something spiritual.

Neither are great - because the central thought is unclear. Writing is to a large extent the process of expressing a thought or feeling. Clarifying exactly what one wants to say is a central part of writing and editing. The author seems to have failed to clearly define their idea or emotion, so its expression is decorated rather than clarified however you phrase it.

13 more replies

apocalypstyx3y ago

>But people who write like that and don't edit it; won't care of how the AI reads it.

Now human language, limited as it already is, is to be is to be humbled before machines that humans have also invented. In our inability to create a machine capable of doing cognitively what humans do, we prefer that humans function as if they had been lobotomized, in deference to our crude machines.

We have built god, and god is stupid, and we bow before him because god has been created, once again, in the image of man.

andrepd3y ago

"Our algorithm isn't stupid, it's the literature which is wrong" is not something I expected to read today.

AstixAndBelix3y ago

Have you ever read a written passage out loud and failed because the text was too contrived and the nature of the intonation only became apparent after re-reading the sentence multiple times?

I have, plenty of times. And in most of those times I got really angry at the author for writing "wrong".

So yes, to me saying that the literature is wrong is nothing unexpected.

1 more reply

cprecioso3y ago

It’s more like “the algorithm is stupid, so it can’t fix bad literature in its brain like humans do”

float43y ago

> Unfortunately the AI completely faceplants, placing an enormous pause right in the middle of the phrase "all places wild". It actually changes the meaning of the text, making it sound more like "...the beauty he saw in all places. Wild!"

I disagree for 2 reasons:

1. There's a perfectly fine reason to put a pause between "places" and "wild": to put emphasis on "wild". Bill doesn't see beauty in all places, but specifically in all wild places.

2. Interpreting the narration as "[...] all places. Wild!" is farfetched because the narrator pronounces "wild" very calmly and softly.

I agree the pause is a bit too long, but I was expecting way worse when I read your comment about how "the AI completely faceplants".

the_other3y ago

I disagree with both reasons.

1. A long pause between "places" and "wild", to me, signals their dis-association, that "wild" does not link with "places". However, the lack of punctuation in the written text implies the phrase "all places wild", the "all wild places" you refer to. I'm with the GP here, the AI didn't convey the meaning I'd expect from the text.

2. Also, the preceding text seems to discuss a certain "ineffability", a spiritual/magic in the world that seems diffuse, broad and subtle. With that context, pronouncing "wild" calmly and softly ties it to the earlier ineffability rather than the more discrete "places". Again, this reinforces the "...places. Wild!" interpretation. I am very impressed the AI used two or more modes of expression (time, tone) to express a feeling... but I disagree that's what the text held.

Maybe the AI's smarter than me.

float43y ago

> A long pause between "places" and "wild", to me, signals their dis-association, that "wild" does not link with "places"

But that long pause is still way shorter than the pauses at actual periods! Look at the waveform[0]: the ovals are the periods, the rectangle is the pause between "places" and "wild". I guess due to the length of the pauses at the actual periods, my brain automatically discards the possibility of "all places. Wild!" and then the best interpretation clearly is "all places wild" for me.

But hey, the fact that at least two people interpreted it differently says something. Maybe this was more of a faceplant than I initially realized.

[0] https://imgur.com/a/j9CtZFZ

1 more reply

csande173y ago

Maybe this is just me, but I find it really unnatural to pause in the middle of a short phrase like "all things wild". I'd emphasize "wild" by putting stress on it, not by pausing.

But this excerpt is the end of a paragraph that begins with "Bill loved and found solace in nature." and describes taking walks and looking at the moon. This doesn't support emphasizing "wild" because the author has already established that information; the important part of the sentence is "deep spiritual yearning", or maybe "easiest to express", since the author then goes on to discuss how, after he died, Bill expressed himself from beyond the grave in other ways.

I could kind of understand the AI not quite getting the emphasis right, since that's a judgement call that requires a lot of context from the rest of the book. But breaking up "all places wild" like the sample does suggests that it doesn't understand the basic grammar of the sentence.

m_eiman3y ago

I'm guessing that the main reason they require you to go through their "preferred partners" is that their job is to insert annontations that the speech generator needs to make it sound good.

I wonder if this is because it's difficult work, or if the tools aren't user friendly enough to put in the hands of untrained users. If it's the latter I suppose that sooner or later we won't need to go through the partners.

csande173y ago

The sample is taken from a full audiobook that is currently available for sale, so you'd think they would've put an annotation on "all places wild" if they had that ability.

adolph3y ago

Perhaps this bug could be a feature?

  A panda walks into a café. He orders a sandwich, eats it, then draws a gun and fires two shots in the air.

  "Why?" asks the confused waiter, as the panda makes towards the exit. The panda produces a badly punctuated wildlife manual and tosses it over his shoulder.

  "I'm a panda," he says at the door. "Look it up."

  The waiter turns to the relevant entry in the manual and, sure enough, finds an explanation.

  "Panda. Large black-and-white bear-like mammal, native to China. Eats, shoots & leaves."

https://en.wikipedia.org/wiki/Eats,_Shoots_&_Leaves

twobitshifter3y ago

By normal sentence construction it would be all wild places. It’s a good test as you say, the author is having fun with grammar to give you the idea that it’s a subset of places rather than “wild places”, so I would expect it to be written with a link between places and wild, “all places-wild.”

csande173y ago

I've seen this construction used in a lot of places (like the name of "All Things Digital", the predecessor to Re/code), but I have never heard of anyone putting a dash between the last two words.

raverbashing3y ago

And might I add, Helena does not sound like a Soprano given the speech tone. (Also, smoking is bad mmmmkay)

jkmcf3y ago

A number of books I've been reading aloud leave out the comma after a prepositional phrase (and friends), and it totally throws off my cadence.

bongobingo13y ago

Will we see this level of voice synthesis in the public domain? Maybe I am out of touch but I found those examples very impressive - more impressive than the jobs vs rogan demo a few months back.

But I am also saddened at a future where all this is locked up in corporate hands - obviously there is money needed and (licensed) data needed too which Apple can get at.

Honestly I would rather eschew the ethics of it and just consume any and all voice data (youtube, podcasts, existing audiobooks, radio) that has transcripts available, perhaps because I assume corpos are already doing this, if it means we can have a free and open data model that people can run at home, maybe that makes me evil.

criddell3y ago

> saddened at a future where all this is locked up in corporate hands

I would guess this rolls out from big companies first because the first version is always the most difficult. It’s only going to get easier to do and I would totally expect end-user controlled TTS systems to get better and eventually exceed the capabilities of this version from Apple. Of course Apple isn’t going to sit still, so they will continue to improve as well.

Are there examples from ten or twenty years ago of a technology that big companies had locked up that never made it out to end users? What we have might lag, but it seems like this stuff only ever gets easier to do.

abraxas3y ago

Linux is a prime example of a technology that was at first far behind its proprietary counterparts but eventually dominated and nearly extinguished all non-free competitors.

FinalBriefing3y ago

For very specific uses. Linux is not a good options for general computers used by everyday people. That experience is still owned by large corporations.

1 more reply

wahnfrieden3y ago

as AI requires larger and larger data sets and processing power (energy, money), how will public domain catch up to their wealth accumulation? it's not like linux where chipping away at functionality and incremental UX adds up sufficiently over time

rickdeckard3y ago

My guess is that such "digital narration" is on the brink of becoming available as a service to authors and publishers, with Amazon and Apple trying to get ahead of it by selling a product that can only be published on their platforms then. They are surely able to undercut human narration, but even on digital narration they have an unfair advantage of earning a share on every sale as well.

Will be interesting to see how this develops. Either independent digital narration becomes competitive enough that a publisher simply gets it done once and then sells it on all platforms, or this new platform-exclusive model is so disruptive that it becomes even less economic to produce a Audiobook, effectively making Audiobooks exclusive to Apple and Amazon/Audible (and whoever else has such a digital narration engine).

patentatt3y ago

I'm curious what the economics are here, because getting someone to just read a book can't possibly be that expensive, right? I'm not talking celebrity voice-over work, but just getting anybody with a passable voice to sit down a spend a day or so reading a book into a microphone? Does that really cost that much more than having someone sit down a listen to the whole AI-generated audio book to do QA? And then there's all of the engineers who have to work on the project, the hardware to run all of it, etc. Seems like if it takes months to do and has to be QA'd anyways, it can't be that much more cost effective. Now, if it's completely computer-generated and can be a push-button feature? Sure, that makes sense. I wonder how close they are to that.

cianmm3y ago

You need to take into account that nobody can perfectly read a book first time. I've done a reasonable amount of short-fiction narration for audio magazines, and even with many hours worth of script reading under my belt I still fumble every third or forth sentence. So I need to re-read those, and then I (or somebody else) needs to find those errors and edit them out, replacing with the fixed audio. Then people need to listen to the entire file at least once to ensure the whole thing makes sense and I didn't leave out a sentence somewhere or something.

Audiobook narration is one of those things that is remarkably labour intensive, certainly much more than I'd have guessed before getting into it.

1 more reply

hidelooktropic3y ago

If you're talking about using a trained voice actor, yes, they will cost an hourly rate to do this, which is reflective of the care and training they put into their craft. One should also expect that they don't simply press record, and then do an entire read through. They will go back and try different takes on segments

If we're not talking about a trained voice actor, you may want to look at some thing like LibriVox, which is entirely volunteer run on public domain works, and while the efforts are appreciated, the quality is noticeably different.

r00fus3y ago

As an audible customer I can tell you clearly that I often buy audiobooks by a particular narrator (discovering new authors) because the narrator is so good.

I mean, Siri is quite good at reading texts (I imagine that's a huge training corpus) but I think we'll be in "uncanny valley" for quite a few years.

It's possible that the public just gets used to that.

1 more reply

nmfisher3y ago

This is achievable with a very modest (say 50-100k) budget for voice actors and compute. Less if you’re happier with lower quality. Speech synthesis is probably one of the few areas in ML that’s trivially accessible to smaller orgs.

Even Stable Diffusion was only 600k which is hardly outside the reach of a startup. The only ridiculously expensive models reserved for the big end of town are the GPT3 etc language models, and I fully expect the data/compute requirements to come down considerably in the near future.

enlyth3y ago

This is already easily achievable on consumer hardware, you can train something like Tacotron 2 + WaveRNN on your own computer to achieve similar, if not better results. Check out this repo:

https://github.com/coqui-ai/TTS

You can also clone someone's voice by finetuning a pretrained LJSpeech model and training a vocoder from scratch, I've had great success with as little as 15 minutes of speech.

rockemsockem3y ago

It really isn't though. The level of fidelity that Apple is demonstrating in those samples is very impressive. You can generate fine voices with little work using repos like that one, but to get to the level Apple has takes a lot of work.

EDIT: "fine voices" not "find voices"

GordonS3y ago

> You can also clone someone's voice by finetuning a pretrained LJSpeech model and training a vocoder from scratch, I've had great success with as little as 15 minutes of speech.

Are you able to point to any articles to help get started with this please?

enlyth3y ago

Unfortunately, I'm not aware of any beginner friendly tutorials.

The way I learned it was just by experimenting with various GitHub repositories (e.g. https://github.com/fatchord/WaveRNN or the one I linked earlier) but it takes a lot of trial and error. Might do a writeup at some point if I have time.

davidzweig3y ago

Check my other comment in this thread, you might try our dataset. :)

enlyth3y ago

Will check it out, thanks

pmontra3y ago

Not from the companies selling audiobooks IMHO.

They will use this technology to save money on human speakers. If they release it into the public domain we'll end up with ebooks that can read themselves aloud and they'll lose part of the incomes from audio books.

My Samsung phone can read ebooks with one of Samsung's voices right now, but it does an awful job at pauses. Basically, no commas. With a good voice I could turn each one of my ebooks in an audiobook.

raffraffraff3y ago

I don't really think I could listen to an AI voice reading a book. Perhaps some technical stuff, but not fiction. Even the difference between a mediocre vs good voice actor is huge, and can mean the difference between finishing an audio book or stopping after a chapter.

Edit: to be very specific, a really good voice actor will take on different voices depending on which character is speaking, and will act out scenes realistically. I honestly can't imagine any AI being able to do that.

kaba03y ago

There is https://commonvoice.mozilla.org/en though I’m not sure where and how is it being used.

jonathankoren3y ago

Common Voice is more about building a dataset or how people talk, especially with accents.

While there is/was a voice synthesis project at Mozilla it was rudimentary like 3 years ago

carb3y ago

You can play with it on https://uberduck.ai/ and they have a very active Discord!

idle_zealot3y ago

What exactly is "Open Source" about uberduck? It looks like a proprietary tts saas to me; no links to a git repo and the "developer" section just shows how to get an API key and hit their service.

andy_ppp3y ago

"Once your request is submitted, it takes one to two months to process the book and conduct quality checks."

My guess is that these generated voices are far from perfect and someone has to go in and crank the algorithm to get a fair number of passages to not sound strange.

Even in the example Helena there is a word at the end of a sentence that sounds like it should be in the middle and has a bit of weirdness to it. Still, very impressive, I think better than I remember Amazon Poly sounding.

qwerty4561273y ago

Why is that we still can't have a perfect or near-perfect text-to-speech given all the astonishing advances in ML taking place? Is TTS an area nobody is really interested in or is it harder than generating beautiful pictures and sophisticated writings?

This thing by Apple already sounds way better than the best I heard previously (NextUp Ivona) but it is not an instant-result offline tool yet and that's sad.

potatolicious3y ago

It's an extremely hard problem that lots of people are working on.

The trick is that we have "pretty good" results for TTS as-is, but it has significant shortcomings that are more visible in certain use cases. The operative word is "prosody" - the cadence, rhythm, and pauses that are natural when speaking that are heavily dependent on context and content.

Prosody is incredibly important to making natural utterances - TTS models that do not model prosody end up sounding very "flat", which is mostly all of the heavily used TTS engines out there right now. This is less glaring for short responses like what you would get from a voice assistant, but becomes a huge grating problem when you try to do long-form text reading.

The trick with prosody is that it often requires information and context not contained in the text to be read. You would apply a different rhythm and stresses to a horror story than you would to a conference keynote speech, for example. It also requires a more sophisticated understanding of the content of text rather than simply its constituent words, in order to figure out proper stresses and pauses.

All of this is eminently solvable (as demonstrated here with the book voices) but is... rather difficult. I suspect we're not terribly close to a product where you can just feed it raw text (with annotating or otherwise providing additional data as context) and get a great result.

lilyball3y ago

I wonder how effective it would be to feed the book to some other AI model first that reads the whole thing and figures out the necessary context that it can then go back and feed into the TTS model

davidzweig3y ago

I wanted to make a human-like reading feature for our language-learning software. Training a model isn't too hard using something like https://github.com/coqui-ai/TTS.

The weak link was the available free/open datasets. You needed a single speaker with a pleasant voice, 20hrs+ material from varied sources, recorded in a good recording enviroment with a good mic etc. For English, the go-to was LJSpeech, which doesn't fulfill all these requirements. I say 'was', as I haven't followed developments recently.

Last year we decided to make our own dataset with a Irish woman, Jenny. She has a soft Irish lilt.

Never got around around to training the model, but I will upload the raw audio and prompts here in a few hours (need to pay my internet bill in town..):

https://github.com/dioco-group/jenny-tts-dataset/blob/main/R...

davidzweig3y ago

Added a download link to the readme: https://github.com/dioco-group/jenny-tts-dataset/blob/main/R...

1 more reply

lxgr3y ago

Are visual generative models really that more advanced, or could this simply be an artifact of their usage?

With generative visual art, people usually spend considerable time fine-tuning the results, and we don‘t get to see all the prompts that didn‘t work out (except if the failure is notable in some way).

Try e.g. illustrating a book, but using only your first prompt for each image. I think the quality would be in the same ballpark as having Siri narrate the corresponding audiobook.

dagmx3y ago

You’re describing the effects of familiarity with a subject.

Stable Diffusion / Midjourney etc look really pretty to the average person but on closer inspection they rarely hold up out of the box. If you’re an experienced artist you pick up on all the flaws right away.

ChatGPT and Copilot are similar. The answers seem confident , but the more familiar you are with the domain of the answer, the quicker it becomes to see how flawed the results are.

Now going back to TTS. You’ve spent your whole life knowing what speech sounds like. Unlike those other models that require an extra level of domain knowledge, everyone innately knows the sound of humans speaking. So you’re effectively, and subconsciously, a domain expert.

This is essentially the uncanny valley effect but for other areas.

dahfizz3y ago

Chat-GPT and StableDiffusion aren't perfect. They still produce weird responses or visual artifacts sometimes. But, it can be easy to move past these idiosyncrasies.

I think the brain is just more sensitive to speech, because inflection and tone is a key part of communication. So even subtle artifacts in the generated voice are really obvious and annoying.

Plus, as another commenter mentioned, books are long. An issue in 1 out of 10,000 words will be enough to break emersion.

ghaff3y ago

I don't find it easy to look past their idiosyncrasies at all although they can produce impressive results with fiddling and luck.

Listening to these samples, they're still robotic sounding to me just listening for 10 seconds. I can't imagine wanting to listen to a whole book like this given the option of listening to an even modestly-competent voice actor.

JustSomeNobody3y ago

My uneducated opinion on the matter is that we are more tolerant of subtle errors in pictures and writings than we are in sounds. Subtle variations of tone can change the meaning of a conversation that words on paper just can't convey.

qwerty4561273y ago

As a person who has listened to a number of non-fiction books narrated by Microsoft Sam I don't really mind "subtle variations of tone" :-) This Apple thing will already satisfy me if they release it as an offline app for converting plain text files into audio files.

justincormack3y ago

The pictures have weird limbs and the writing has errors. A book is long therefore there will be a lot of issues.

andy_ppp3y ago

Because to understand intonation and rhythm you need to perfectly understand context and emotions. I don’t doubt these things will be added soon enough, so I expect perfect reading end of this year and perfect reading in anyone’s voice with a few samples in 2024.

dmitriid3y ago

> Why is that we still can't have a perfect or near-perfect text-to-speech

Define perfect ;) Two different people will read the same text slightly (or not slightly) differently.

A great example is this brilliant and funny rendition of "To be or not to be" by Tim Minchin, Benedict Cumberbatch, Judy Dench, David Tennant and others. Sorry for the Facebook link, but it's very hard to find this video anywhere: https://www.facebook.com/watch/?v=585252039999241

michaelmior3y ago

> is it harder than generating beautiful pictures and sophisticated writings

I think one differences with pictures and audio is that pictures are two-dimensional and we can't take in the whole image at a time. This makes it easy to overlook flaws without careful inspection. And I find that although there has been some amazing AI-generated art, there are still a lot of rough edges and tweaking required to get really clean images.

As far as writing goes, I suspect that the rules of written language are easier to learn and violations easier to overlook than with generated audio.

romeros3y ago

murf dot ai has near perfect tts. I think we had a major AI breakthrough in the last couple of years

BuckyBeaver3y ago

"Digitally narrated titles are a valuable complement to professionally narrated audiobooks"

Yeah, right. What a lame attempt to deflect the (fully warranted) criticism that this will put audiobook narrators out of work.

rickdeckard3y ago

I think the cat is already out of the bag, and the death of human narration is imminent.

The fight now seems to be whether this transformation happens only in production, or companies like Apple succeed in breaking the total Audiobook price apart into "license" and "production", only buying the license and have the production done on their proprietary servers.

Overall, I agree it's inevitable that this results in a sharp decline in professionally narrated Audiobooks...

amelius3y ago

> Overall, I agree it's inevitable that this results in a sharp decline in professionally narrated Audiobooks...

Or, it will increase demand for audiobooks so much that more humans are needed to create top-notch audio.

rickdeckard3y ago

I don't know how badly narrated Audiobooks can increase the demand for Audiobooks as a whole.

The only scenario I could imagine is a narration language where Audiobooks didn't exist so far for economic reasons (i.e. low population). Digital narration could bring down production costs to the point of making it economic, basically creating the audiobook market for this language.

But then, if the narration is bad (which it likely is because TTS is worse in minor languages), I don't know how many users could be converted to pay a premium for a better human narration. Also here I think it's more likely that funding will be used to improve the narration engine as a whole instead of going back to hiring humans and renting a studio for each book...

rdevsrex3y ago

One of the things I like is when the narrator has to suppress a laugh during a funny passage, or can express a character's anger or frustration.

Until AI is so good that it can mimic emotion, I think there will be a market for human narrators. Of course it will be smaller than what it is now, but I think people will specialize.

rickdeckard3y ago

> One of the things I like is when the narrator has to suppress a laugh during a funny passage, or can express a character's anger or frustration.

I doubt there are big issues for an AI to verbally mimic emotion. Placing emotion correctly in a long narration might be tricky if there are no indicators in the text, but I'm sure there will be a convenient self-service authoring tool where the Author/Publisher can adjust the emotion with a slider if he wants to finetune the result...

> Until AI is so good that it can mimic emotion, I think there will be a market for human narrators. Of course it will be smaller than what it is now, but I think people will specialize.

A smaller market means higher cost per-unit, so higher prices per Audiobook. If the publisher needs to meet a specific price (i.e. to be listed on flat-fee audiobook-portals) he might be forced to produce digital narration as a default, which means the market for an additional "premium human narration" will have to prove itself first.

I doubt that such a bar will be reached in most cases. It's more likely that people complaining about bad narration will put pressure on AI-engines to improve, but not form a market where a critical mass will pay additional 20$ for human narration...

BuckyBeaver3y ago

They won't be able to. Not enough people will reject computer-generated narration and insist on real narrators for them to remain employed.

prepend3y ago

Sad for the narrators, but good for the world.

There are so many books I have that don’t have an audiobook version because the economics just aren’t there.

This is an easy way that technology can expand human experience.

Even in situations where the author reads the book, I expect it will be cheaper to train an AI to sound like the author than to put the author in a studio for 50 hours (or whatever).

I thought it was a really dumb ruling when Amazon was forced to remove the text to speech function from kindle.

I also think that screen readers are hobbled to avoid this legal issue. I want to send any text through a narrator bot and have it read it to me. There is zero need to compensate anyone other than the developer who writes the AI (and hopefully it will have open source versions donated by developers).

If I’ve bought a book, I should be able to use it as I like.

elil173y ago

To be fair, a lot of narrators are really not doing that good of a job. I frequently hear audiobooks that have been rushed through production - mispronounced words, strange cadence, and overacting. I'll take what I heard in that demo over ACX crap any day.

The current iteration of this technology is not competing with truly great narrators, like Tom Hanks or Jim Dale.

tinus_hn3y ago

This is as valid a criticism as complaining the phonograph will put musicians out if work. Time marches on, adapt.

acdha3y ago

I'm mixed on that because while I appreciate the craft of a professional narrator, I support a group of users who are mostly blind and there's a constant tradeoff between availability and the quality of an audio book. People value good recordings – people often have favorite narrators and will select books based on that, sometimes even outside their normal interests (which wasn't something I'd previously appreciated) – but if it's something you want to read, having it now versus a year from now matters.

abraxas3y ago

Some narrators never should have had their jobs to begin with. I'm viscerally angry at the narrator of "Permutation City" on Audible. Such a great book that could not have a more bored, disinterested narrator who clearly doesn't understand the text he's narrating.

An AI TTS engine at this level would do a far better job of it than that particular dude.

339559853y ago

Huh? The professionally narrated audiobooks don’t get memory hole’d from the earth because Apple announced this service. The sentence you quoted is intended to emphasize that professionally narrated audiobooks will continue to be available on the platform.

karmasimida3y ago

Can they just train a model to narrate for themselves and change section that the model makes mistakes?

TBH, human narrator on Audible sometimes just reads the stuff aloud

339559853y ago

So happy to hear Apple putting African American voices front and center for this initiative. Along with Google’s push to make camera lenses / computational photography more accurate for darker skin (and Apple following suit) this feels like a real step forward for inclusion.

drexlspivey3y ago

Voices have a race?

339559853y ago

Speakers of African American English have distinct timbre, rhythm, and cadences in their speech. There has long been a lack of these distinct features in TTS. Apple appears to have added a “Black voice” (though to be clear, there are speakers of African American English who are not Black) in 2021:

https://www.consumerreports.org/digital-assistants/apples-ne...

1 more reply

vortegne3y ago

Often there are clear voice and accent differences. Are you saying there aren't? That would be a very strange statement.

mhuffman3y ago

Not so strange! ... at least when it comes to the legal community[1] (pdf).

For those around during the O.J. Simpson trial, this was a very, very contentious topic during the trial!

[1]https://cpb-us-w2.wpmucdn.com/sites.wustl.edu/dist/3/2151/fi...

1 more reply

drexlspivey3y ago

Accent depends on where you were born/raised, not on your race.

1 more reply

tchalla3y ago

Are you saying that voices across all races, cultures etc sound the same?

drexlspivey3y ago

No, it depends on the culture. Both whites and blacks raised on East London will have similar accents for example.

Since we are strawmanning, are you saying that cultures and races have a 1-to-1 relationship?

troupe3y ago

I seriously doubt you can tell someone's genetics (their race) based on their voice. On the other hand, you can tell a huge amount about the culture in which they were raised.

theshrike793y ago

The OG Kindle with keyboard used to have text to speech too.

It was killed by publishers who wanted to charge separately for audiobooks.

If Apple has somehow managed to get the licensing for this, I might consider buying from Apple Books in the future.

rickdeckard3y ago

The product is actually directed at authors, offering them to have an Audiobook produced which is "digitally narrated by Apple Books".

The Author still needs to hold the rights for Audiobook production, and he needs to license a third party to produce an Audiobook (no matter if human or "digitally" narrated).

I guess that's why this is aimed at "independent Authors", to circumvent negotiating Apple's rev.share and exclusivity for that production with established publishers...

shrx3y ago

> It was killed by publishers who wanted to charge separately for audiobooks.

Any sources on this?

bentley3y ago

Amazon decides Kindle speech isn’t worth copyright fight (2009) — https://arstechnica.com/gadgets/2009/03/amazon-backs-off-on-...

See also the recent lawsuit covering the other direction, automatic transcription of Audible books. https://www.geekwire.com/2020/amazon-owned-audible-major-pub...

shrx3y ago

Thanks. I never really used Kindle Speech on my 3G Kindle, but was curious why it was suddenly gone in later versions.

carlob3y ago

You are never buying from Apple Books, it's the usual DRMed crap, more like a rental. Amazon had gotten a lot of flak, buy Apple is not any better in this respect...

ezfe3y ago

If DRM works offline, it's not a rental. It's not desirable, but don't call it a rental, that just moves focus away from what the real problems are here.

TheCoelacanth3y ago

Even if it works offline, it probably won't continue to work if you need to switch to a new device after their DRM servers are turned off.

DRMed content can never truly be purchased.

criddell3y ago

Most Apple and Amazon Books are DRM encumbered, but not all. AFAIK, there isn’t a way to tell before you buy the book except by choosing books from publishers that don’t use DRM on any of their titles.

2Gkashmiri3y ago

licensing as in?

theshrike793y ago

License/copyright for written form of book is different than the read form.

Author might sell the book rights to company X and audiobook rights to company Y. Company X can't do a text to speech version of their book without infringing on Y. Y cant do speech to text of their version without angering X.

Licenses are fun!

alanwreath3y ago

I guess I kind of wish they would just offer the AI narration as a feature of Apple ebooks. Such that, if you buy the book, you can have ebooks read to you by your phone. I am really just buying books off audible with the subscription they offer. There are some books (tech books that is) that are offered as audio books and I gobble those up. There are, however, many more epub/digital books that I never buy not because I'm uninterested in the content but only because I don't have the time to sit down. I assume that for said books the audience isn't large enough (and may never be) to merit anyone ever recording the audiobook.

There are certain books that I think I'll always buy the non-AI variant because narrators can bring more than natural reading, they sometimes bring different characters (sometimes more feminine, more baritone, more stereotypical accents) -- and I would melt if AI could do that kind of voice acting.

cjensen3y ago

Amazon at one time tried to add voice reading to Kindle books. Authors were absolutely livid. Audiobooks are a significant income source, and taking that away from authors is going to make authors decline to sell digital books on your platform. Apple is doing this right by making it an author's choice.

alanwreath3y ago

I totally see this point — I’m making a separate one for ebooks that aren’t getting purchased because they haven’t been (and probably never will be) narrated by a real human.

cjensen3y ago

I agree with you. Apple making this an author choice avoids some authors being angry while enabling more sales for lower-volume books that as you point out, will otherwise not have an spoken version.

napier3y ago

Yeah that would be cool and in a free-er market more friendly to healthily competitive innovation diversity, we'd already have natural-sounding narration built into every browser and reader; which would be an accessibility UX boon. But the publisher oligopoly wouldn't stand for it and there's not really much of an incentive for the marketplace monopsony-monopoly janus/jani to bake it into their products for free or a flat fee, even if the lawsuits from rights-holders standing to lose out on audiobook sales would be worth swatting away.

layer83y ago

This should be a feature available for any text document. The existing iOS text-to-speech is almost barely adequate, but not really.

Elof3y ago

I use the iOS Speech Accessibility feature to listen to ebooks and it works great.

alanwreath3y ago

that is a good feature, it just seemed like the reading I'm hearing off of the samples for these audiobooks is a tad less robotic.

ghaff3y ago

I'm skeptical given the state of the art.

There is way more good audio content out there than I have the time/interest to listen too and I can't believe I'm that atypical. And a book is a relatively big listening time commitment. I'll happily pay a few dollars more for a good human narrator.

falcolas3y ago

A couple of comments from a narrator whose worked through ACX before.

First, the last few years have seen a race to the bottom for narrator rates, since during the pandemic it was recognized that it's a job that can be easily done from home, literally from anywhere in the world.

Accordingly, the up-front cost for an average quality 10 hour book is only about $1,500, and can be turned around in under two weeks from a human. If you get a really good and well known narrator, it's still only about $4,000 (and you'll probably get it quicker).

Also, they're going to be competing against revenue share models from Amazon/Audible, which basically means it costs the author nothing up front. Amazon's bite out of audiobook sales is absurdly high (60%), so other companies could (and are) definitely improve on that. It's mostly a fight against Audible's brand at this point.

But back to AI: AI narration is going to have to compete against humans willing to do a lot of work for very little pay. I'm honestly not sure the compute and QA costs will be competitive. And frankly, even if it is cheaper, it's not as if those savings will be passed back to the customer.

If you'd like to look at how little it can cost to get a human to do voiceover work, check out fiverr.com and look for voice actors and narrators.

ghaff3y ago

Thanks for the insights.

That doesn't really surprise me. On the flip side, I can get high quality transcriptions for $1/minute (given good audio quality).

People, even those with better than average talent at some things, just often aren't that expensive. I suspect the same is true for some of the generative AI tasks that people are all excited about--new grad English majors are pretty cheap, especially if they can be assisted by search/generative AI.

sigmar3y ago

Fully agree with this. I could understand TTS for quickly converting articles to audio (and of course for visually impaired ppl), but for books the current state of this tech doesn't interest me. The qualities I want from a good narrator aren't in these samples (correct emphasis within a sentence, variable pacing dependent on context). For fiction books, good narrators will change timbre and accents depending on who is speaking in the text, not clear if they tried to achieve this at all (could have potential to use a different digital voice entirely).

I hope that the results from this type of production are clearly labeled as computer generated in the store. I don't think putting "AB Apple Books" is clear or sufficient, for someone that doesn't know about this tech "AB" sort of looks like a placeholder for some unnamed human.

rockemsockem3y ago

I tend to agree for the current product that Apple is releasing. IMO this technology starts to get interesting for books once folks can generate audiobooks for titles that do not have an audiobook (and likely never will due to publisher disinterest). When I first got into audiobooks I wanted to go back and listen to one of my favorite books and it wasn't available :/. I also see certain audiobooks described as "unlistenable" because of something the reader does.

troupe3y ago

I often find books I want to read that don't have audio versions or the audio version is for a different translation than what I want to read. So if you are looking for specific things to read the (eventual) use of this type of technology to open up some of those in audio format seems useful.

(But totally agree with you that this isn't going to replace a good human narrator.)

rickdeckard3y ago

Indeed. But this option will only be available if a critical mass is also willing to pay a few dollars more for human narration.

In times of flat-fee Audiobook platforms the pressure to bring down audiobook production costs will only increase, funding a full-fledged Audiobook production for each book will only become harder to justify.

Moreover, looking at what Apple describes here, they seemingly want to establish digital narration (quality) as a metric for competition between Audiobook marketplaces, not publishers. So if this works out, the major platforms will compete on digital narration and publishers will have less incentive to actually produce an Audiobook with human narrators...

ghaff3y ago

That's fair and it's true of a lot of AI/ML versions of content. I still paid for human transcriptions of podcasts when I was doing them because the time needed to clean up the ML versions just wasn't a good return. But the day will certainly come when that calculus changes.

I know nothing about the economics of audiobooks. And will note that there are free public domain audio books already https://librivox.org/. But TTS will improve and, at a minimum, improved TTS will be a benefit for people who can't read for various reasons.

rickdeckard3y ago

Well, an intent of Apple seems to be to break the price of an Audiobook into license and production cost, take control of the production using AI and pay only the publishing license, instead of having to buy the rights to sell an Audiobook as a separate work of Art (because in the end, their engine will create the work of Art from the written word).

Sadly I don't see how this will make Audiobooks any better than human narration could. It's more about streaming platforms taking more control over the content and have experienced people train their proprietary TTS engine along the way.

Just to avoid confusion on Librivox: They offer Audiobooks of works which are already in the public domain (so not only the Audiobook is in public domain, also the rights for the book have already expired). So it's a platform allowing people to make free narration of already-free content.

thomasahle3y ago

I have a bunch of books on my "to read" list, that still don't have a narration. I would happily listen to an AI version as an alternative.

ivansavz3y ago

Or you could use the command `say` on the command line on any current mac to get good-enough text-to-speech.

See full script here: https://gist.github.com/ivanistheone/de3ccb244224d101bb93320... and this doc explains how you can setup a keyboard shortcut to turn any text selection into an audio book https://docs.google.com/document/d/1mApa60zJA8rgEm6T6GF0yIem...

Here is a sample if you want to hear what it sounds like: https://minireference.com/static/tmp/constructive_feedback.m...

which is the audio from this blog post https://productivityhub.org/2019/04/19/how-to-deliver-constr...

IMHO, the computer generated voice like Alex (the default voice on mac OS) sounds better because it doesn't try to do inflections or add human character when it is reading. The real-world narrators (voice actors) seem to add too much "character" into their reading, which me distracts from the story/content. The only exception is when the narration is done by the author, in which case I'd consider the narration as part of the work.

scinerio3y ago

I personally find that lack of character and inflections has completely turned me off of audiobooks in favor of podcasting. The typical monotone audio narration causes me to zone out into other thoughts and I find myself rewinding or just turning it off.

ivansavz3y ago

I've experienced that too, but only for "bad writing."

I'm normally able to follow narrative (both fiction and non-fiction) that has something to teach, and also enjoying listening to classic literature no problem...

But sometimes I'm reading a long article from the internet and I experience what you describe (losing track of what author is saying, having to rewind to get the point). After a while, I realize it's not the computer's fault, but the article is just very low content (e.g. some authors just pile on words, emotions, opinions without a coherent narrative or point). Recently I noticed I'm able to detect GPT-generated text this way too... words without content or message.

Perhaps the monotone TTS can be a test for the "meaning" contents of a text.

rockemsockem3y ago

If you're still interested, give graphic audio a try. They're full-cast (usually a different reader for each character) high production quality audiobooks. They cost accordingly too though.

https://www.graphicaudio.net/

hiidrew3y ago

TIL. This is an interesting capability of the command line. Have any more fun ones? (at least fun to a CL noob)

ivansavz3y ago

Here is another script `getmp3.sh` that you can use to download .mp3 file from any youtube music video:

   #!/usr/bin/env bash
   echo "Downloading mp3 from $1"
   yt-dlp -x --audio-format mp3 "$1"

You'll need to install https://github.com/yt-dlp/yt-dlp#installation before you can use that. As you can see, the "script" is just so to add a options `-x` (extract audio) and `--audio-format mp3` to convert to mp3 in the end.

macintux3y ago

I haven't figured out how to effectively search my HN favorites, else I'd probably be able to find a few more of these, but this was discussed recently:

https://git.herrbischoff.com/awesome-macos-command-line/abou...

nstart3y ago

I'm not sure how I feel about the quality of this. It... drones. The samples are really bad. It's not that the voices sound robotic. The reading is boring. If this was tested on me without prior knowledge, I'd say "not sure if human or not. But it's a bad reading either way".

Edit: Adding a few more details to my thoughts to say why it's boring. Good narration is so much more than correct pauses. Pacing. Emotion around words like death and life. Ensuring that sentences don't repeatedly end on the same inflection tone. Modulation of rhythm. None of that is there.

The last time I ran into this was when a known person started a youtube channel where they put together the script and the video and then used an AI to narrate the script. I assumed it was an AI because I figured that's how said acquaintance would have managed the budget. But it was incredibly tedious to listen to. You can see this in work here (https://www.youtube.com/watch?v=yWVvmKpCBDg). Has the same feel of the Apple digital narration. I don't know how I could listen to that easily for over an hour.

aeneasmackenzie3y ago

I have listened to a fair amount of fan fiction read aloud with what seems to be the default Siri voice in the fanfiction.net app. Like watching something with subtitles, you don't hear the drone after a while. It does put a lot more emphasis on the quality of the writing, which can be rough with fanfic.

walterbell3y ago

Good to see mainstream accessibility work on high-quality text to speech.

On iOS/macOS, VoiceDream has offered flexible apps with voices in multiple languages and accents since 2012, e.g. for reading PDFs, web, non-DRM ePub books and scanned text, https://www.voicedream.com/about/.

TacWitch3y ago

Mitchell sounds exactly like Ray Porter. I wonder if he trained a model with them or they did it without his direct approval

Maursault3y ago

> Mitchell sounds exactly like Ray Porter.

"Mitchell sounds like Ray Porter" is more accurate. Accent is completely different, so they don't sound exactly alike. My first impression was that Mitchell sounds like a Clay Jenkinson,[1] but more a cross between Jenkinson and a male newscaster I can't place who is probably retired now, but who also narrated documentaries.

[1] https://www.youtube.com/watch?v=d8UoL0AOL3k&t=1m49s

jjcm3y ago

I reached out to him on Twitter [0] asking exactly this.

According to a Reddit comment [1] it is, but they haven’t posted their source.

[0] https://twitter.com/pwnies/status/1610857711008370688?s=46&t...

[1] https://reddit.com/r/apple/comments/103iogu/_/j305eby/?conte...

CubsFan10603y ago

I don’t think you got the correct Ray Porter.

Correct one is https://twitter.com/Ray__Porter

jjcm3y ago

Doh. Thank you - didn't realize it was the double underscore.

tartuffe783y ago

Yea I immediately recognized him, hope he's getting paid for this.

urbandw311er3y ago

He likely won’t be able to comment without breaching an NDA — my recollection is that the guy who voiced the original SIRI got in all sorts of trouble for trying to capitalise on it.

SurgeArrest3y ago

Would love to see one day AI used to read annotated text with multiple voices, so each person in a novel gets his/her voice and also narrative voice. Would be epic and actually better than most audio books read by a single person attempting to pretend to speak in different voices.

Was always frustrated that Kindle was barred from reading books, it is such a natural progression of capabilities. Leave up to the buyer to decide if they want to pay for the person, but default TTS should be allowed for all books, such that if I read book at home and then can continue listening during a walk.

habosa3y ago

Whoah the Madison voice sounds _exactly_ like Julia Whelan, who is a real audiobook narrator. I have listened to many articles on Audm (narrated news articles) using her voice. I wonder if she had a part in this?

southp4w3y ago

The Mitchell voice also sounds almost exactly like Ray Porter, another real audiobook narrator.

fuzzywalrus3y ago

I heard that and it was uncanny how much it sounded like ray porter to me

urbandw311er3y ago

If she did she’ll likely be NDA’d up to her neck to prevent her ever admitting it publicly.

drexlspivey3y ago

I’m very interested to see if/how the model can figure out to produce a different voice when a character is speaking and how to keep the same voice for each character across the whole book consistent. Especially the second problem is not trivial at all from my understanding of how neural networks work.

mgh23y ago

Samples from https://news.ycombinator.com/item?id=34253424

Male: https://books.apple.com/gb/audiobook/pale-moon-rising/id1640...

Female: https://books.apple.com/gb/audiobook/shelter-from-the-storm/...

jasonjmcghee3y ago

The male sample has many pauses that are distractingly long. Pretty interesting.

mark_l_watson3y ago

While TTS has broad application, I am skeptical about Apple’s process being able to compete with the best narrators.

I have my biases. My wife and I have licenses to listen to about 500 Audible audio books and in the best of them I feel like I have a human to human relationship with the narrator that is similar to a relationship with the author.

I have mostly worked on deep learning projects over the last eight years, so I appreciate the tech as an engineering tool, but I think it is important to view tech as a servant to human experience.

RegnisGnaw3y ago

Not every book can afford the best narrators. A good one charges somewhere in the ballpark of $300-500 a finished hour. So for an average novel that's like $3000-5000. Not all writers can afford that, so this is an cheaper alternative.

Its like an Lexus vs Hyundai.

macintux3y ago

Best narrators? Agreed. But as someone who did some recording for a local radio station years ago, it’s an incredibly time-intensive project to record a book.

mensetmanusman3y ago

Can it pronounce ”Buffalo buffalo Buffalo buffalo buffalo buffalo Buffalo buffalo” ?

https://en.m.wikipedia.org/wiki/Buffalo_buffalo_Buffalo_buff...

banana_giraffe3y ago

The biggest failing for me:

They don't appear to be making any attempt to have the narration use inflections for different characters. This is probably fine for nonfiction books, but for fiction books, it can make it really hard to follow when a narrator does this, at least for me.

visitect3y ago

I understand this sentiment. I've been an audible subscriber since 2002 and have listened to hundreds of audiobooks, mostly fiction. The ability of the narrator to provide distinct, interesting voices for each character figures prominently into my enjoyment of a book. This technology sounds fantastic, and will likely enable pleasant narration of texts that would otherwise never have it, but I don't think it's likely to replace professional narrators for a large number of cases.

jdmoreira3y ago

2019 I did an "Ask HN: When will text-to-speech replace narrators"

Most answers did not age well I would say.

https://news.ycombinator.com/item?id=20931541

phphphphp3y ago

The answers seem to have aged quite well! Apple specifically say that this is designed as a complement to human narration and not a replacement: some aspect of that is Apple protecting relationships with the audiobook industry, but it’s also fair to say that the humanness of human narration is still unmatched by text-to-speech. For many people, high quality text-to-speech will be good enough to enjoy but it doesn’t seem likely to be an audiobook replacement today.

That said, I suspect it’s less of a technology issue and more cultural: the current generation growing up on robot audio will have different expectations to previous generations, the lack of humanness probably isn’t an issue for them, so even if the technology can never exactly replicate the humanness of audiobooks, it may not matter.

(The text-to-speech on TikTok, for example, is often lampooned by people not of the TikTok generation for being disconcerting and annoying, whereas young people seem to have no problem with it… and that voice is much more artificial sounding than these Apple examples).

phphphphp3y ago

Also as far as predictions go, I predict that a few years from now there’ll be research and think pieces about the impact learning from artificial voices has on young children.

lern_too_spel3y ago

If this is your standard, the answer was 2020 or maybe earlier. https://news.ycombinator.com/item?id=34271329

I don't think publishers will stop hiring professional narrators for the bestsellers (instead of the long tail that has been text-only until recently) for a few years more.

rahoulb3y ago

I have to say that a number of audiobooks I've bought recently have been totally spoilt by the narration - and the AI voices on offer here all sound better than the humans involved in those books.

I also remember reading, a couple of years ago, that Apple was working on improving the voices for Siri - resulting in me thinking "surely Apple has more important things to work on to improve Siri". I guess this was what they were actually aiming for.

Dig1t3y ago

It's so true! It's such a tragedy when a really good book is ruined by bad narration. All of the Hitchhikers Guide books (except for the first one, which is masterfully read by Stephen Fry), are a good example, but there are tons of other examples. There's usually only one version as well so you can't even shop around for a version with a better narrator either.

rahoulb3y ago

If you can find it, look for the BBC Radio version of Hitchhikers, from the late 70s, early 80s. Each version has a slightly different variation on the story and the radio one is my favourite of the lot.

My friend had a copy of the scripts from the radio series and we used to use it as inspiration - "open the scripts at a random page and our band name will be the first thing that Marvin says ... Zootlewurdle - perfect!"

I've got them on a set of CDs somewhere, but have nothing that will play them...

dirtyid3y ago

I've learned to enjoy particularly robotic TTS narrations because after a while it becomes associated with my internal voice, and it feels like I'm "reading" the book versus elaborately produced audiobooks that are experiencially closer to enjoying a show. And feel less hesitation about listening at 3x speed. With practice, I can start layering on personality to different characters with neutral TTS voices.

mcphage3y ago

> I guess this was what they were actually aiming for.

I would imagine it was the other direction—that this is a way for them to test out their improvements to Siri.

kumarvvr3y ago

Is it just a coincidence that only a few days back I saw an article or video about how low a payment audible makes to its authors?

It was a famous author too.

And now this announcement.

Ruthalas3y ago

Likely referring to Brandon Sanderson's recent comments[0].

Edit: Previous discussion on HN: [1]

[0] https://winteriscoming.net/2022/12/30/brandon-sanderson-blas...

[1] https://news.ycombinator.com/item?id=34104204

dpratt3y ago

“Mitchell” is clearly Ray Porter, an absolutely phenomenal voice actor/narrator. He’s done a range of audiobooks across many genres, and anything he does is a pleasure to listen to.

I sure hope that he negotiated a gigantic amount for his data/training set provided to Apple, as this tech sounds like it’s getting advanced enough to obviate a giant chunk of the narration business overnight.

southp4w3y ago

Just replied this to someone else. Instantly recognizable as Ray! Phenomenal narrator. He and Nick Podehl are my favorites

flakiness3y ago

I hope Amazon does this too but probably they won't because it'll cannibalize Audible. This is good move from Apple. Take the credit!

WalterBright3y ago

The first versions of Amazon's Kindle did this. Then they got mired in lawsuits over it from the book industry.

gdcbe3y ago

Why the law suits. I would say: offer it with a big warning that this is automated and might be bad quality. Those books that have narrated versions can come with a buy extra button to have a professionally narrated version. For which I’m probably more then happy to pay. As an avid reader with 3 kids it’s nice to be able to switch between audio and text book, depending on my availability and context. As such of no professionaly narrated version is available am automated one is way better then nothing.

Also, having the progress sync between digital text book and audio version is a great UX improvement!

potatolicious3y ago

> Why the law suits.

Because the contracts between Amazon and the publishers do not permit this kind of work.

Like, this is (legally) a pretty open and shut issue. Amazon has a license from the publisher for the book that permits a certain range of activities that are contemplated by the contract: showing short excerpts for marketing for example.

IP licenses are carefully constructed, often because the rights are sold to different parties. You may for example license a book to adapt into a movie, but the contract would likely forbid you from adapting it for a TV show. The publisher may sell those two rights separately to two different parties.

And these contracts likely either specifically forbid constructing an audiobook (automated or otherwise) from the original book, or at least do not contemplate it. That is a clear source of lawsuits where Amazon is likely clearly in the wrong.

Another reply here mentions that the publishes do not "like" that Amazon went ahead and did this without consultation. That may very well be true - but more importantly (and constructively) it's likely that Amazon's behavior is specifically forbidden in the contract they voluntarily signed on to with the publisher.

rockemsockem3y ago

The lawsuits were because Amazon just did it, without any author/publisher buy in. They just said "you can convert this ebook using our TTS with the kindle" and the publishers did *not* like that. IMO that is how this should work, any text anywhere should be convertible to auido.

gdcbe3y ago

It’s anyway just making it more difficult right, as ebooks can be converted anyway using an external program, many which have way to use gui’s. But i guess it’sas far as they can make use of liability laws

hintymad3y ago

I had the same question, but Apple seems offer the narration to authors so they can choose whether a book comes with text to speech.

thedailymail3y ago

I'm curious about the thinking behind "mysteries and thrillers, and science fiction and fantasy are not currently supported." Is it because these genres use more non-standard words? Or is there more risk of the Apple voice being used to narrate sonething inappropriate?

londons_explore3y ago

I'm very surprised they make this a feature for authors, rather than a feature for users.

As a user focussed feature, it could read any audiobook out loud, and would differentiate apple books from any other audiobook platform.

I guess it's aimed at authors, because then they can charge the author for the 'narration' service....

dmitriid3y ago

> I'm very surprised they make this a feature for authors, rather than a feature for users.

Licensing. Audiobooks are a different license from ebooks, and trying to narrate an audiobook will infringe the licensing terms.

londons_explore3y ago

Apple is big enough they could just tell authors "here is our new narration feature for users. If you don't like it, pull your books off our platform.".

No author is going to win a twitter flame war because they don't want a 'speak it out loud' button provided by apple on their ebooks.

Besides - all ebooks on Apple platforms already support this via:

Settings → General → Accessibility → VoiceOver → ON turns on the VoiceOver feature

This is just a higher quality version of the same.

hnbad3y ago

Your view is a bit myopic as you seem to assume Apple can just throw its weight at it and that winning in the US market would be the same as winning globally.

Books are one of the oldest media around and as a consequence most jurisdictions have fairly extensive and specific laws around them and their authors' and publishers' rights. In many cases copyright itself is ultimately based on laws created to deal with authors and publishers.

Infamously, Amazon tried to snub book pricing laws and lost. Google got into hot water with newspaper publishers because its news app violated laws originally written for citing physical newspapers.

This is like suggesting Spotify just ignore the RIAA or Netflix should just stream all content in all countries, licensing restrictions be damned.

dmitriid3y ago

> Apple is big enough they could just tell authors "here is our new narration feature for users. If you don't like it, pull your books off our platform.".

You assume that it is authors who sell books on Apple platform (or on any platform).

Let me introduce you to a couple of chunky boys:

- Penguin Random House https://en.wikipedia.org/wiki/Penguin_Random_House

- HarperCollins https://en.wikipedia.org/wiki/HarperCollins

- Simon & Schuster https://en.wikipedia.org/wiki/Simon_%26_Schuster

Veen3y ago

It’s not big enough to to tell the major publishers that. Especially if it wants mainstream titles on its relatively small ebook platform.

339559853y ago

Apple is big enough to have the EU dictate what port to ship their phones with, too. The regulatory landscape requires careful navigating.

rockemsockem3y ago

Apple would absolutely get sued by publishers, but, I don't think that Apple providing a high quality narration tool with their phone which can be used on any ebook would infringe licensing terms. It's not like they'd be saying "Here is this specific title for you to read/buy/rent", they'd be releasing a tool with the power to do that.

dmitriid3y ago

> I don't think that Apple providing a high quality narration tool with their phone which can be used on any ebook would infringe licensing terms.

Oh, it definitely would. This produces a derivative work besides anything else.

> It's not like they'd be saying "Here is this specific title for you to read/buy/rent", they'd be releasing a tool with the power to do that.

That's exactly what they will be doing from the point of view of copyright law.

1 more reply

Hakeemmidan3y ago

I regularly use my MacBook for narration. I look forward to this being better-adapted for books.

lvl1023y ago

I am surprised it’s taking so long for big tech companies to roll this out. I suppose there’s too much money in audiobooks? Perhaps licensing issues? If anyone can get a “celebrity reads aloud” feature that would be Apple and that could be big.

rickdeckard3y ago

Yes, licensing. Narrated books are a separate publishing license, with its own production-cost model. So a tech.company like Apple already having rights to sell the written words is still not licensed to publish a narrated version. To sell an Audiobook, they have to acquire the license for a separate product, which so far includes the produced Audiobook itself.

It seems Apple is trying to get the audiobook license directly from those authors who didn't sign the license away yet, undercutting production cost for the Audiobook with "digital narration" and then earning more money per sale...

I guess we're going to see human narration die very fast now, at least for some common languages, and tech.companies want to ensure that they can split license from production cost, instead of being forced to buy the "whole" Audiobook...

karmasimida3y ago

TBH it will depend on how those narrations turn out to be.

Cherr-picking is easy, but I paid for this, it needs to be human quality throughout

rickdeckard3y ago

True, but I expect the critical mass of the target group will remain to be people frequently listening to Audiobooks --> Those people are more-likely subscribed to an Audiobook service --> are not paying a per-audiobook price but a flat-fee --> The flat-fee for the whole catalog is lower than buying one audiobook.

I would expect this target-group to access a mix of human and digitally narrated books during the transition to digital narration, with best-selling books still being narrated by humans. Users may then complain about the quality of the digital narration, but will keep using such services as the price-expectation is now set.

--> A competition for better digital narration engines will likely drive evolution of the engine and authoring tools, further increasing the pressure of publishers to justify the bottom-line of per-book Audiobook production costs.

> Cherry-picking is easy, but I paid for this, it needs to be human quality throughout

That's a really interesting aspect. If the Audiobook delivers the content with a human voice but still not engaging enough, how many listeners would put the blame on the narration rather than the book itself... ("I like this new song of Metallica, but I don't like how they sing it")

kzrdude3y ago

The quality needs to be very high for it not to be jarring. I stay with podcasts if the host has a "good radio voice". ("Not even all human voices are good enough for me.") It's just a very intimate medium to have someone's voice right in your head. If the voices have annoying quirks, those audiobooks will not be loved.

Yes, it needs to be tens of hours of perfectly good narration.

sidibe3y ago

I have used TTS for books for years. It might be jarring the first time but you get used to it very quickly, I don't even think about it. I actually prefer it to my Audible books usually because it doesn't ever do anything to annoy me like some narrators, and I can understand it at whatever speeds. There are some dialogue heavy books where I have to read along though to be sure of who is talking.

lvl1023y ago

I think you’re underestimating state of the art in this area. You can do amazing things with just a few minutes of readings.

rockemsockem3y ago

No. You are vastly overestimating it. There is a reason there is no broadly available TTS service like there is for text-to-image. Anyone who says you can clone a voice in a few minutes is not talking about human-quality.

1 more reply

hxugufjfjf3y ago

I cannot put into words how much I want Silmarillion read by Scarlett Johansson.

abudabi1233y ago

I want on Spotify early faery tales and parables of dragon-slayers and dragons as read aloud by Joe Rogan, Michael Bisping, Tom Aspinal to kids of the lower political-economy.

barelysapient3y ago

It’s the spaces between sentences and between pauses that need work. Usually a reader will take a breath or finish exhaling. Instead Apple’s audio drops to 0 db. It sounds unnatural. Mechanical.

jensensbutton3y ago

Tech doesn't seem that great? Google demo'd Duplex in 2018 and it was so good at voice synthesis that people were arguing about whether or not it's ethical to not disclaim you're talking to AI.

freyr3y ago

It’s odd they label the voices as “soprano” and “baritone,” because they don’t sound like it.

I suspect it’s to avoid labeling the speakers as “male” and “female.” What a joke.

dmazin3y ago

Woah, the Madison voice is quite clearly Julia Whelan.

tibbydudeza3y ago

Lucasfilms licensed James Earl Jones iconic voice in perpetuity when he retired from acting (come on there is only ONE Darth Vader voice) - no doubt he and his estate in the future will get nice annual royalty cheques from Mr Mouse.

I wonder how this works with her ???.

thiht3y ago

Mitchell sounds like Alan Rickman, I felt like I was hearing Snape reading the sentence. I like it

jasonjmcghee3y ago

I heard and rather confident it's Ray Porter- it's uncanny. Instantly recognized it (have listened to a number of books narrated by him)

habosa3y ago

100%, I noticed it immediately.

mensetmanusman3y ago

It would be great if these voices were Siri options. The new Siri choices are quite bad…

mongol3y ago

What accent is used by the first voice? It creeps me slightly, some kind of rz sounds...

dwighttk3y ago

Can I plug a standard ebook in and get digital narration? Or just Apple Books?

strictnein3y ago

Neither. The author of the book has to utilize these.

kmeisthax3y ago

So at first I thought Apple had managed to undo the whole nonsense that book publishers strong-armed Amazon into doing where they can turn off TTS narration to make you buy the audiobook. But instead this seems to just be "hey if you want to use TTS instead of a paid narrator, you can". Already kinda shitty, but there's extra shit cherries on top: the resulting recording cannot be used on other book platforms. Only Apple Books and the DRM nonsense that killed public libraries.

So it's also platform capitalist moat building, too - i.e. a scheme to deprive Amazon Audible of audiobooks. The more publishers opt to use Apple Books digital narration instead of paying a narrator, the less audiobooks will be available on Audible. And yes, you are allowed to still pay a narrator and distribute that recording on Audible, but... if you could do that, then obviously you wouldn't bother with Apple's TTS system.

Of course, the flipside of this is that Amazon refuses to bother with copyright enforcement for books not on Audible. Cory Doctorow found this out the hard way[0]. If you do not license your work to Amazon, Amazon will pay someone else to copy it, and for some reason DMCA 512 protects them[1]. So I can see this winding up being a functionally unused service anyway.

[0] https://www.audible.com/pd/Why-None-of-My-Books-Are-Availabl...

[1] To be clear, I do not oppose DMCA 512; I just don't think DRM-bearing audiobook services that charge money should be allowed to disclaim copyright liability. DMCA 512 and 1201 should be mutually exclusive.

j / k navigate · click thread line to collapse

234 comments

csande173y ago

The "Helena" sample contains a pretty good test case of this system's ability to guess where emphasis and pauses go:

> I know Bill carried within him deep currents of spiritual yearning that he found easiest to express through the beauty he saw in all places wild.

(I checked in an online sample of the ebook: there is no punctuation in this sentence.)

I wonder if any of these AI speech synthesis tools come with an editing tool that you could use to tell it not to put the pause there.

cprecioso3y ago

I guess this is more of a case of garbage in - garbage out. The original sentence itself is not well-structured. But people who write like that and don't edit it; won't care of how the AI reads it.

[1]: I didn't know this audience existed until I saw this video on it (and the cons that happen) from Dan Olson: https://www.youtube.com/watch?v=biYciU1uiUw

leokennis3y ago

But that's the thing...is "perfection" worth 3 hours of video editing for something you casually consume?

I think almost any audiobook listener will vastly prefer a serviceable but imperfect audiobook when compared with no audiobook at all.

klondike_klive3y ago

1 more reply

Wowfunhappy3y ago

> But that's the thing...is "perfection" worth 3 hours of video editing for something you casually consume?

Any book of any length took countless hours to write and edit. Yes, I think it's worth a bit of extra time for a human to go through and read the thing aloud.

paulryanrogers3y ago

ClassyJacket3y ago

I can understand that sentence fine, I don't see anything wrong with it.

dbspin3y ago

Writer here, it's a terrible sentence.

"I know Bill carried within him deep currents of spiritual yearning that he found easiest to express through the beauty he saw in all places wild."

"Deep currents of spiritual yearning" is both cliched and unspecific / unclear.

"That" is superfluous.

It's unclear how Bill is expressing the beauty he sees, and the sentence structure implies he's somehow responsible for the natural beauty.

"All places wild" is wilfully awkward and anachronistic.

There are countless better ways to say the same thing. For example the tone would be similar and the sentence more concise just to say:

'When Bill spoke of the beauty of nature, I could sense it inspired spiritual feelings in him.'

Or simply: Natural beauty inspired in Bill a yearning for connection to something spiritual.

13 more replies

apocalypstyx3y ago

>But people who write like that and don't edit it; won't care of how the AI reads it.

We have built god, and god is stupid, and we bow before him because god has been created, once again, in the image of man.

andrepd3y ago

"Our algorithm isn't stupid, it's the literature which is wrong" is not something I expected to read today.

AstixAndBelix3y ago

Have you ever read a written passage out loud and failed because the text was too contrived and the nature of the intonation only became apparent after re-reading the sentence multiple times?

I have, plenty of times. And in most of those times I got really angry at the author for writing "wrong".

So yes, to me saying that the literature is wrong is nothing unexpected.

1 more reply

cprecioso3y ago

It’s more like “the algorithm is stupid, so it can’t fix bad literature in its brain like humans do”

float43y ago

I disagree for 2 reasons:

1. There's a perfectly fine reason to put a pause between "places" and "wild": to put emphasis on "wild". Bill doesn't see beauty in all places, but specifically in all wild places.

2. Interpreting the narration as "[...] all places. Wild!" is farfetched because the narrator pronounces "wild" very calmly and softly.

I agree the pause is a bit too long, but I was expecting way worse when I read your comment about how "the AI completely faceplants".

the_other3y ago

I disagree with both reasons.

Maybe the AI's smarter than me.

float43y ago

> A long pause between "places" and "wild", to me, signals their dis-association, that "wild" does not link with "places"

But hey, the fact that at least two people interpreted it differently says something. Maybe this was more of a faceplant than I initially realized.

[0] https://imgur.com/a/j9CtZFZ

1 more reply

csande173y ago

Maybe this is just me, but I find it really unnatural to pause in the middle of a short phrase like "all things wild". I'd emphasize "wild" by putting stress on it, not by pausing.

m_eiman3y ago

I'm guessing that the main reason they require you to go through their "preferred partners" is that their job is to insert annontations that the speech generator needs to make it sound good.

csande173y ago

The sample is taken from a full audiobook that is currently available for sale, so you'd think they would've put an annotation on "all places wild" if they had that ability.

adolph3y ago

Perhaps this bug could be a feature?

  A panda walks into a café. He orders a sandwich, eats it, then draws a gun and fires two shots in the air.

  "Why?" asks the confused waiter, as the panda makes towards the exit. The panda produces a badly punctuated wildlife manual and tosses it over his shoulder.

  "I'm a panda," he says at the door. "Look it up."

  The waiter turns to the relevant entry in the manual and, sure enough, finds an explanation.

  "Panda. Large black-and-white bear-like mammal, native to China. Eats, shoots & leaves."

https://en.wikipedia.org/wiki/Eats,_Shoots_&_Leaves

twobitshifter3y ago

csande173y ago

I've seen this construction used in a lot of places (like the name of "All Things Digital", the predecessor to Re/code), but I have never heard of anyone putting a dash between the last two words.

raverbashing3y ago

And might I add, Helena does not sound like a Soprano given the speech tone. (Also, smoking is bad mmmmkay)

jkmcf3y ago

A number of books I've been reading aloud leave out the comma after a prepositional phrase (and friends), and it totally throws off my cadence.

bongobingo13y ago

Will we see this level of voice synthesis in the public domain? Maybe I am out of touch but I found those examples very impressive - more impressive than the jobs vs rogan demo a few months back.

But I am also saddened at a future where all this is locked up in corporate hands - obviously there is money needed and (licensed) data needed too which Apple can get at.

criddell3y ago

> saddened at a future where all this is locked up in corporate hands

abraxas3y ago

Linux is a prime example of a technology that was at first far behind its proprietary counterparts but eventually dominated and nearly extinguished all non-free competitors.

FinalBriefing3y ago

For very specific uses. Linux is not a good options for general computers used by everyday people. That experience is still owned by large corporations.

1 more reply

wahnfrieden3y ago

rickdeckard3y ago

patentatt3y ago

cianmm3y ago

Audiobook narration is one of those things that is remarkably labour intensive, certainly much more than I'd have guessed before getting into it.

1 more reply

hidelooktropic3y ago

r00fus3y ago

As an audible customer I can tell you clearly that I often buy audiobooks by a particular narrator (discovering new authors) because the narrator is so good.

I mean, Siri is quite good at reading texts (I imagine that's a huge training corpus) but I think we'll be in "uncanny valley" for quite a few years.

It's possible that the public just gets used to that.

1 more reply

nmfisher3y ago

enlyth3y ago

This is already easily achievable on consumer hardware, you can train something like Tacotron 2 + WaveRNN on your own computer to achieve similar, if not better results. Check out this repo:

https://github.com/coqui-ai/TTS

You can also clone someone's voice by finetuning a pretrained LJSpeech model and training a vocoder from scratch, I've had great success with as little as 15 minutes of speech.

rockemsockem3y ago

EDIT: "fine voices" not "find voices"

GordonS3y ago

> You can also clone someone's voice by finetuning a pretrained LJSpeech model and training a vocoder from scratch, I've had great success with as little as 15 minutes of speech.

Are you able to point to any articles to help get started with this please?

enlyth3y ago

Unfortunately, I'm not aware of any beginner friendly tutorials.

davidzweig3y ago

Check my other comment in this thread, you might try our dataset. :)

enlyth3y ago

Will check it out, thanks

pmontra3y ago

Not from the companies selling audiobooks IMHO.

raffraffraff3y ago

kaba03y ago

There is https://commonvoice.mozilla.org/en though I’m not sure where and how is it being used.

jonathankoren3y ago

Common Voice is more about building a dataset or how people talk, especially with accents.

While there is/was a voice synthesis project at Mozilla it was rudimentary like 3 years ago

carb3y ago

You can play with it on https://uberduck.ai/ and they have a very active Discord!

idle_zealot3y ago

What exactly is "Open Source" about uberduck? It looks like a proprietary tts saas to me; no links to a git repo and the "developer" section just shows how to get an API key and hit their service.

andy_ppp3y ago

"Once your request is submitted, it takes one to two months to process the book and conduct quality checks."

My guess is that these generated voices are far from perfect and someone has to go in and crank the algorithm to get a fair number of passages to not sound strange.

qwerty4561273y ago

This thing by Apple already sounds way better than the best I heard previously (NextUp Ivona) but it is not an instant-result offline tool yet and that's sad.

potatolicious3y ago

It's an extremely hard problem that lots of people are working on.

lilyball3y ago

I wonder how effective it would be to feed the book to some other AI model first that reads the whole thing and figures out the necessary context that it can then go back and feed into the TTS model

davidzweig3y ago

I wanted to make a human-like reading feature for our language-learning software. Training a model isn't too hard using something like https://github.com/coqui-ai/TTS.

Last year we decided to make our own dataset with a Irish woman, Jenny. She has a soft Irish lilt.

Never got around around to training the model, but I will upload the raw audio and prompts here in a few hours (need to pay my internet bill in town..):

https://github.com/dioco-group/jenny-tts-dataset/blob/main/R...

davidzweig3y ago

Added a download link to the readme: https://github.com/dioco-group/jenny-tts-dataset/blob/main/R...

1 more reply

lxgr3y ago

Are visual generative models really that more advanced, or could this simply be an artifact of their usage?

Try e.g. illustrating a book, but using only your first prompt for each image. I think the quality would be in the same ballpark as having Siri narrate the corresponding audiobook.

dagmx3y ago

You’re describing the effects of familiarity with a subject.

ChatGPT and Copilot are similar. The answers seem confident , but the more familiar you are with the domain of the answer, the quicker it becomes to see how flawed the results are.

This is essentially the uncanny valley effect but for other areas.

dahfizz3y ago

Chat-GPT and StableDiffusion aren't perfect. They still produce weird responses or visual artifacts sometimes. But, it can be easy to move past these idiosyncrasies.

I think the brain is just more sensitive to speech, because inflection and tone is a key part of communication. So even subtle artifacts in the generated voice are really obvious and annoying.

Plus, as another commenter mentioned, books are long. An issue in 1 out of 10,000 words will be enough to break emersion.

ghaff3y ago

I don't find it easy to look past their idiosyncrasies at all although they can produce impressive results with fiddling and luck.

JustSomeNobody3y ago

qwerty4561273y ago

justincormack3y ago

The pictures have weird limbs and the writing has errors. A book is long therefore there will be a lot of issues.

andy_ppp3y ago

dmitriid3y ago

> Why is that we still can't have a perfect or near-perfect text-to-speech

Define perfect ;) Two different people will read the same text slightly (or not slightly) differently.

michaelmior3y ago

> is it harder than generating beautiful pictures and sophisticated writings

As far as writing goes, I suspect that the rules of written language are easier to learn and violations easier to overlook than with generated audio.

romeros3y ago

murf dot ai has near perfect tts. I think we had a major AI breakthrough in the last couple of years

BuckyBeaver3y ago

"Digitally narrated titles are a valuable complement to professionally narrated audiobooks"

Yeah, right. What a lame attempt to deflect the (fully warranted) criticism that this will put audiobook narrators out of work.

rickdeckard3y ago

I think the cat is already out of the bag, and the death of human narration is imminent.

Overall, I agree it's inevitable that this results in a sharp decline in professionally narrated Audiobooks...

amelius3y ago

> Overall, I agree it's inevitable that this results in a sharp decline in professionally narrated Audiobooks...

Or, it will increase demand for audiobooks so much that more humans are needed to create top-notch audio.

rickdeckard3y ago

I don't know how badly narrated Audiobooks can increase the demand for Audiobooks as a whole.

rdevsrex3y ago

One of the things I like is when the narrator has to suppress a laugh during a funny passage, or can express a character's anger or frustration.

Until AI is so good that it can mimic emotion, I think there will be a market for human narrators. Of course it will be smaller than what it is now, but I think people will specialize.

rickdeckard3y ago

> One of the things I like is when the narrator has to suppress a laugh during a funny passage, or can express a character's anger or frustration.

> Until AI is so good that it can mimic emotion, I think there will be a market for human narrators. Of course it will be smaller than what it is now, but I think people will specialize.

BuckyBeaver3y ago

They won't be able to. Not enough people will reject computer-generated narration and insist on real narrators for them to remain employed.

prepend3y ago

Sad for the narrators, but good for the world.

There are so many books I have that don’t have an audiobook version because the economics just aren’t there.

This is an easy way that technology can expand human experience.

Even in situations where the author reads the book, I expect it will be cheaper to train an AI to sound like the author than to put the author in a studio for 50 hours (or whatever).

I thought it was a really dumb ruling when Amazon was forced to remove the text to speech function from kindle.

If I’ve bought a book, I should be able to use it as I like.

elil173y ago

The current iteration of this technology is not competing with truly great narrators, like Tom Hanks or Jim Dale.

tinus_hn3y ago

This is as valid a criticism as complaining the phonograph will put musicians out if work. Time marches on, adapt.

acdha3y ago

abraxas3y ago

An AI TTS engine at this level would do a far better job of it than that particular dude.

339559853y ago

karmasimida3y ago

Can they just train a model to narrate for themselves and change section that the model makes mistakes?

TBH, human narrator on Audible sometimes just reads the stuff aloud

339559853y ago

drexlspivey3y ago

Voices have a race?

339559853y ago

https://www.consumerreports.org/digital-assistants/apples-ne...

1 more reply

vortegne3y ago

Often there are clear voice and accent differences. Are you saying there aren't? That would be a very strange statement.

mhuffman3y ago

Not so strange! ... at least when it comes to the legal community[1] (pdf).

For those around during the O.J. Simpson trial, this was a very, very contentious topic during the trial!

[1]https://cpb-us-w2.wpmucdn.com/sites.wustl.edu/dist/3/2151/fi...

1 more reply

drexlspivey3y ago

Accent depends on where you were born/raised, not on your race.

1 more reply

tchalla3y ago

Are you saying that voices across all races, cultures etc sound the same?

drexlspivey3y ago

No, it depends on the culture. Both whites and blacks raised on East London will have similar accents for example.

Since we are strawmanning, are you saying that cultures and races have a 1-to-1 relationship?

troupe3y ago

I seriously doubt you can tell someone's genetics (their race) based on their voice. On the other hand, you can tell a huge amount about the culture in which they were raised.

theshrike793y ago

The OG Kindle with keyboard used to have text to speech too.

It was killed by publishers who wanted to charge separately for audiobooks.

If Apple has somehow managed to get the licensing for this, I might consider buying from Apple Books in the future.

rickdeckard3y ago

The product is actually directed at authors, offering them to have an Audiobook produced which is "digitally narrated by Apple Books".

The Author still needs to hold the rights for Audiobook production, and he needs to license a third party to produce an Audiobook (no matter if human or "digitally" narrated).

I guess that's why this is aimed at "independent Authors", to circumvent negotiating Apple's rev.share and exclusivity for that production with established publishers...

shrx3y ago

> It was killed by publishers who wanted to charge separately for audiobooks.

Any sources on this?

bentley3y ago

Amazon decides Kindle speech isn’t worth copyright fight (2009) — https://arstechnica.com/gadgets/2009/03/amazon-backs-off-on-...

See also the recent lawsuit covering the other direction, automatic transcription of Audible books. https://www.geekwire.com/2020/amazon-owned-audible-major-pub...

shrx3y ago

Thanks. I never really used Kindle Speech on my 3G Kindle, but was curious why it was suddenly gone in later versions.

carlob3y ago

You are never buying from Apple Books, it's the usual DRMed crap, more like a rental. Amazon had gotten a lot of flak, buy Apple is not any better in this respect...

ezfe3y ago

If DRM works offline, it's not a rental. It's not desirable, but don't call it a rental, that just moves focus away from what the real problems are here.

TheCoelacanth3y ago

Even if it works offline, it probably won't continue to work if you need to switch to a new device after their DRM servers are turned off.

DRMed content can never truly be purchased.

criddell3y ago

2Gkashmiri3y ago

licensing as in?

theshrike793y ago

License/copyright for written form of book is different than the read form.

Licenses are fun!

alanwreath3y ago

cjensen3y ago

alanwreath3y ago

I totally see this point — I’m making a separate one for ebooks that aren’t getting purchased because they haven’t been (and probably never will be) narrated by a real human.

cjensen3y ago

I agree with you. Apple making this an author choice avoids some authors being angry while enabling more sales for lower-volume books that as you point out, will otherwise not have an spoken version.

napier3y ago

layer83y ago

This should be a feature available for any text document. The existing iOS text-to-speech is almost barely adequate, but not really.

Elof3y ago

I use the iOS Speech Accessibility feature to listen to ebooks and it works great.

alanwreath3y ago

that is a good feature, it just seemed like the reading I'm hearing off of the samples for these audiobooks is a tad less robotic.

ghaff3y ago

I'm skeptical given the state of the art.

falcolas3y ago

A couple of comments from a narrator whose worked through ACX before.

If you'd like to look at how little it can cost to get a human to do voiceover work, check out fiverr.com and look for voice actors and narrators.

ghaff3y ago

Thanks for the insights.

That doesn't really surprise me. On the flip side, I can get high quality transcriptions for $1/minute (given good audio quality).

sigmar3y ago

rockemsockem3y ago

troupe3y ago

(But totally agree with you that this isn't going to replace a good human narrator.)

rickdeckard3y ago

Indeed. But this option will only be available if a critical mass is also willing to pay a few dollars more for human narration.

ghaff3y ago

rickdeckard3y ago

thomasahle3y ago

I have a bunch of books on my "to read" list, that still don't have a narration. I would happily listen to an AI version as an alternative.

ivansavz3y ago

Or you could use the command `say` on the command line on any current mac to get good-enough text-to-speech.

Here is a sample if you want to hear what it sounds like: https://minireference.com/static/tmp/constructive_feedback.m...

which is the audio from this blog post https://productivityhub.org/2019/04/19/how-to-deliver-constr...

scinerio3y ago

ivansavz3y ago

I've experienced that too, but only for "bad writing."

I'm normally able to follow narrative (both fiction and non-fiction) that has something to teach, and also enjoying listening to classic literature no problem...

Perhaps the monotone TTS can be a test for the "meaning" contents of a text.

rockemsockem3y ago

If you're still interested, give graphic audio a try. They're full-cast (usually a different reader for each character) high production quality audiobooks. They cost accordingly too though.

https://www.graphicaudio.net/

hiidrew3y ago

TIL. This is an interesting capability of the command line. Have any more fun ones? (at least fun to a CL noob)

ivansavz3y ago

Here is another script `getmp3.sh` that you can use to download .mp3 file from any youtube music video:

   #!/usr/bin/env bash
   echo "Downloading mp3 from $1"
   yt-dlp -x --audio-format mp3 "$1"

macintux3y ago

I haven't figured out how to effectively search my HN favorites, else I'd probably be able to find a few more of these, but this was discussed recently:

https://git.herrbischoff.com/awesome-macos-command-line/abou...

nstart3y ago

aeneasmackenzie3y ago

walterbell3y ago

Good to see mainstream accessibility work on high-quality text to speech.

TacWitch3y ago

Mitchell sounds exactly like Ray Porter. I wonder if he trained a model with them or they did it without his direct approval

Maursault3y ago

> Mitchell sounds exactly like Ray Porter.

[1] https://www.youtube.com/watch?v=d8UoL0AOL3k&t=1m49s

jjcm3y ago

I reached out to him on Twitter [0] asking exactly this.

According to a Reddit comment [1] it is, but they haven’t posted their source.

[0] https://twitter.com/pwnies/status/1610857711008370688?s=46&t...

[1] https://reddit.com/r/apple/comments/103iogu/_/j305eby/?conte...

CubsFan10603y ago

I don’t think you got the correct Ray Porter.

Correct one is https://twitter.com/Ray__Porter

jjcm3y ago

Doh. Thank you - didn't realize it was the double underscore.

tartuffe783y ago

Yea I immediately recognized him, hope he's getting paid for this.

urbandw311er3y ago

He likely won’t be able to comment without breaching an NDA — my recollection is that the guy who voiced the original SIRI got in all sorts of trouble for trying to capitalise on it.

SurgeArrest3y ago

habosa3y ago

southp4w3y ago

The Mitchell voice also sounds almost exactly like Ray Porter, another real audiobook narrator.

fuzzywalrus3y ago

I heard that and it was uncanny how much it sounded like ray porter to me

urbandw311er3y ago

If she did she’ll likely be NDA’d up to her neck to prevent her ever admitting it publicly.

drexlspivey3y ago

mgh23y ago

Samples from https://news.ycombinator.com/item?id=34253424

Male: https://books.apple.com/gb/audiobook/pale-moon-rising/id1640...

Female: https://books.apple.com/gb/audiobook/shelter-from-the-storm/...

jasonjmcghee3y ago

The male sample has many pauses that are distractingly long. Pretty interesting.

mark_l_watson3y ago

While TTS has broad application, I am skeptical about Apple’s process being able to compete with the best narrators.

I have mostly worked on deep learning projects over the last eight years, so I appreciate the tech as an engineering tool, but I think it is important to view tech as a servant to human experience.

RegnisGnaw3y ago

Its like an Lexus vs Hyundai.

macintux3y ago

Best narrators? Agreed. But as someone who did some recording for a local radio station years ago, it’s an incredibly time-intensive project to record a book.

mensetmanusman3y ago

Can it pronounce ”Buffalo buffalo Buffalo buffalo buffalo buffalo Buffalo buffalo” ?

https://en.m.wikipedia.org/wiki/Buffalo_buffalo_Buffalo_buff...

banana_giraffe3y ago

The biggest failing for me:

visitect3y ago

jdmoreira3y ago

2019 I did an "Ask HN: When will text-to-speech replace narrators"

Most answers did not age well I would say.

https://news.ycombinator.com/item?id=20931541

phphphphp3y ago

Also as far as predictions go, I predict that a few years from now there’ll be research and think pieces about the impact learning from artificial voices has on young children.

lern_too_spel3y ago

If this is your standard, the answer was 2020 or maybe earlier. https://news.ycombinator.com/item?id=34271329

I don't think publishers will stop hiring professional narrators for the bestsellers (instead of the long tail that has been text-only until recently) for a few years more.

rahoulb3y ago

I have to say that a number of audiobooks I've bought recently have been totally spoilt by the narration - and the AI voices on offer here all sound better than the humans involved in those books.

Dig1t3y ago

rahoulb3y ago

I've got them on a set of CDs somewhere, but have nothing that will play them...

dirtyid3y ago

mcphage3y ago

> I guess this was what they were actually aiming for.

I would imagine it was the other direction—that this is a way for them to test out their improvements to Siri.

kumarvvr3y ago

Is it just a coincidence that only a few days back I saw an article or video about how low a payment audible makes to its authors?

It was a famous author too.

And now this announcement.

Ruthalas3y ago

Likely referring to Brandon Sanderson's recent comments[0].

Edit: Previous discussion on HN: [1]

[0] https://winteriscoming.net/2022/12/30/brandon-sanderson-blas...

[1] https://news.ycombinator.com/item?id=34104204

dpratt3y ago

“Mitchell” is clearly Ray Porter, an absolutely phenomenal voice actor/narrator. He’s done a range of audiobooks across many genres, and anything he does is a pleasure to listen to.

southp4w3y ago

Just replied this to someone else. Instantly recognizable as Ray! Phenomenal narrator. He and Nick Podehl are my favorites

flakiness3y ago

I hope Amazon does this too but probably they won't because it'll cannibalize Audible. This is good move from Apple. Take the credit!

WalterBright3y ago

The first versions of Amazon's Kindle did this. Then they got mired in lawsuits over it from the book industry.

gdcbe3y ago

Also, having the progress sync between digital text book and audio version is a great UX improvement!

potatolicious3y ago

> Why the law suits.

Because the contracts between Amazon and the publishers do not permit this kind of work.

rockemsockem3y ago

gdcbe3y ago

hintymad3y ago

I had the same question, but Apple seems offer the narration to authors so they can choose whether a book comes with text to speech.

thedailymail3y ago

londons_explore3y ago

I'm very surprised they make this a feature for authors, rather than a feature for users.

As a user focussed feature, it could read any audiobook out loud, and would differentiate apple books from any other audiobook platform.

I guess it's aimed at authors, because then they can charge the author for the 'narration' service....

dmitriid3y ago

> I'm very surprised they make this a feature for authors, rather than a feature for users.

Licensing. Audiobooks are a different license from ebooks, and trying to narrate an audiobook will infringe the licensing terms.

londons_explore3y ago

Apple is big enough they could just tell authors "here is our new narration feature for users. If you don't like it, pull your books off our platform.".

No author is going to win a twitter flame war because they don't want a 'speak it out loud' button provided by apple on their ebooks.

Besides - all ebooks on Apple platforms already support this via:

Settings → General → Accessibility → VoiceOver → ON turns on the VoiceOver feature

This is just a higher quality version of the same.

hnbad3y ago

Your view is a bit myopic as you seem to assume Apple can just throw its weight at it and that winning in the US market would be the same as winning globally.

This is like suggesting Spotify just ignore the RIAA or Netflix should just stream all content in all countries, licensing restrictions be damned.

dmitriid3y ago

> Apple is big enough they could just tell authors "here is our new narration feature for users. If you don't like it, pull your books off our platform.".

You assume that it is authors who sell books on Apple platform (or on any platform).

Let me introduce you to a couple of chunky boys:

- Penguin Random House https://en.wikipedia.org/wiki/Penguin_Random_House

- HarperCollins https://en.wikipedia.org/wiki/HarperCollins

- Simon & Schuster https://en.wikipedia.org/wiki/Simon_%26_Schuster

Veen3y ago

It’s not big enough to to tell the major publishers that. Especially if it wants mainstream titles on its relatively small ebook platform.

339559853y ago

Apple is big enough to have the EU dictate what port to ship their phones with, too. The regulatory landscape requires careful navigating.

rockemsockem3y ago

dmitriid3y ago

> I don't think that Apple providing a high quality narration tool with their phone which can be used on any ebook would infringe licensing terms.

Oh, it definitely would. This produces a derivative work besides anything else.

> It's not like they'd be saying "Here is this specific title for you to read/buy/rent", they'd be releasing a tool with the power to do that.

That's exactly what they will be doing from the point of view of copyright law.

1 more reply

Hakeemmidan3y ago

I regularly use my MacBook for narration. I look forward to this being better-adapted for books.

lvl1023y ago

rickdeckard3y ago

karmasimida3y ago

TBH it will depend on how those narrations turn out to be.

Cherr-picking is easy, but I paid for this, it needs to be human quality throughout

rickdeckard3y ago

> Cherry-picking is easy, but I paid for this, it needs to be human quality throughout

kzrdude3y ago

Yes, it needs to be tens of hours of perfectly good narration.

sidibe3y ago

lvl1023y ago

I think you’re underestimating state of the art in this area. You can do amazing things with just a few minutes of readings.

rockemsockem3y ago

1 more reply

hxugufjfjf3y ago

I cannot put into words how much I want Silmarillion read by Scarlett Johansson.

abudabi1233y ago

I want on Spotify early faery tales and parables of dragon-slayers and dragons as read aloud by Joe Rogan, Michael Bisping, Tom Aspinal to kids of the lower political-economy.

barelysapient3y ago

It’s the spaces between sentences and between pauses that need work. Usually a reader will take a breath or finish exhaling. Instead Apple’s audio drops to 0 db. It sounds unnatural. Mechanical.

jensensbutton3y ago

Tech doesn't seem that great? Google demo'd Duplex in 2018 and it was so good at voice synthesis that people were arguing about whether or not it's ethical to not disclaim you're talking to AI.

freyr3y ago

It’s odd they label the voices as “soprano” and “baritone,” because they don’t sound like it.

I suspect it’s to avoid labeling the speakers as “male” and “female.” What a joke.

dmazin3y ago

Woah, the Madison voice is quite clearly Julia Whelan.

tibbydudeza3y ago

I wonder how this works with her ???.

thiht3y ago

Mitchell sounds like Alan Rickman, I felt like I was hearing Snape reading the sentence. I like it

jasonjmcghee3y ago

I heard and rather confident it's Ray Porter- it's uncanny. Instantly recognized it (have listened to a number of books narrated by him)

habosa3y ago

100%, I noticed it immediately.

mensetmanusman3y ago

It would be great if these voices were Siri options. The new Siri choices are quite bad…

mongol3y ago

What accent is used by the first voice? It creeps me slightly, some kind of rz sounds...

dwighttk3y ago

Can I plug a standard ebook in and get digital narration? Or just Apple Books?

strictnein3y ago

Neither. The author of the book has to utilize these.

kmeisthax3y ago

[0] https://www.audible.com/pd/Why-None-of-My-Books-Are-Availabl...

j / k navigate · click thread line to collapse