In other words, take an episode of The Daily and have one language model write a hypothetical article that would summarize what the podcast was about. And then pass that article into the two—speaker model, transcribe the output, and see how well that transcript aligns with the article fed in as input.
I am sure I’m missing essential details, but the natural sound of these podcasts cannot possibly be coming from a text transcript.
https://google-research.github.io/seanet/soundstorm/examples...
I have good and bad news for you - they did not! We were the first podcast to interview the audio engineer who led the audio model:
https://www.latent.space/p/notebooklm
TLDR they did confirm that the transcript and the audio are generated separately, but yes the TTS model is trained far beyond anything we have in OSS or commercially available
I'm actually not sure what to make of that, but it's interesting to note
One cheap trick to overcome this uncanny valley may be to actually use two separate LLMs or two separate contexts / channels to generate the conversations and take "turns" to generate the followup responses and even interruptions if warranted.
Might mimic a human conversation more closely.
(And given there is no LICENSE file, I’m afraid you can only use this code as reference at best right now)
https://github.com/meta-llama/llama-models/blob/main/models/...
(which is referring to the license of Meta Llama 3.2)
If the intention was to make something that you can only use with Llama models, stating that clearly in a separate code license file would be better IMO. (Of course, this would also mean that the code still isn’t open source.)
NotebookLM, far and away, has been the "AI Killer App" for the VAST MAJORITY of bright-but-not-particularly-techy people I know. My 70ish parents and my 8 year old kid are both just blown away by this thing and can't stop playing with it.
Edit: As someone pointed out below, I absolutely mean just the "podcast" thing.
I don't really see MYSELF being into it, but it just seems to WOW the hell out of a lot of people.
Again, I'm absolutely like you and I'm with you. I don't much do podcasts either, but in a way this is why I worded it like this. It struck me as a fun party trick to ignore, but it really seems to GRAB a lot of other people.
Open Source TTS models are slowly catching up, but they still need beefy hardware (e.g. https://github.com/SWivid/F5-TTS)
"Speech Model experimentation: The TTS model is the limitation of how natural this will sound. This probably be improved with a better pipeline and with the help of someone more knowledgable-PRs are welcome! :)"
What this does however demonstrate is that prototyping with LLMs is very fast. I'd encourage anyone who hasn't had a play around with APIs to give it a go.
Disagreed. NLM is novel in how the two hosts interrupt and overlap each other. No other OSS solution does that, they just take turns talking.
Here is a demo movie: https://youtu.be/zVX-SqRfFPA
I am more interested in the other features of NotebookLM. The podcasts are fun but gimmicky.
"Please use the original title, unless it is misleading or linkbait; don't editorialize." - https://news.ycombinator.com/newsguidelines.html