ArXiv Papers as Audiobooks (opens in new tab)

(github.com)

105 pointsAcsmaggart2y ago41 comments

41 comments

When I’ve tried listening to YouTube videos explaining, say, Attention Is All You Need, I find that I cannot do it passively at all. The first 10 or so minutes I’m nodding along, folding laundry or doing dishes, then the presenter says something like “by reifying this tensor against the priors I was just talking about, we’re able to—“ and I have to pause, rewind a couple minutes, grab a piece of paper and actually engage with what’s going on.

I have to imagine listening to raw papers (not even someone like Andrei Karpathy interpreting and presenting it) would be even more difficult. I don’t know if there’s an easy way to passively consume academic literature at all. If it’s important stuff, it will usually be pretty challenging.

chaxor2y ago

There is definetly a way to make this happen though. Little bit o' whisper, Mixtral in some RAG, and you've got yourself a buddy to talk about the paper while it's reading it to you.

Of course everyone will immediately say this is dangerous and it may mislead you by giving wrong explanations, etc etc. and then others will counter with 'it will definitely get better over time' (the best models as products are ~3 years behind the improvements being show in academic work for example). However, ultimately this is just a neat product to make, even if it has some bugs. Listening to TTS right now spends about half the time reading jumbled numbers from tables and listing off author names. So just tackling that alone (which this would do much better) would be valuable.

3abiton2y ago

But listening to a paper passively is the not the same thing as being mentally prepared to converse with an LLM about a dense topic. I feel the usecases are quite different, and I doubt that there is a middle ground between listening passively and learning a complex topic. But maybe I am missing something.

1 more reply

vibrio2y ago

I came to post essentially this. I could listen to review articles in a area I'm familiar with, but listening to primary papers could never work for me.

Gooblebrai2y ago

It also really depends on what type of papers. Psychology papers are very accessible to audio format as are, in general, quite comprehensive.

rhelz2y ago

Just listening while doing nothing else is soporific, but I can imagine finding this invaluable if I had a long commute to work.

beacon2942y ago

I use a combined approach of pre listening then reading the technical writing later sometimes

reader50002y ago

Visual reading of dense papers also leads to failing to understand concepts or distractions.

chaxor2y ago

This is a great point. People will complain if LMs are applying to anything, but ultimately it improves accessibility, and allows for someone to dive deeper when needed.

There will always be ways to misinterpret some academic work, and there are plenty of opportunities in the path of understanding a work to do that.

Allowing someone to engage with a work _at all_ by lifting some barriers (visually impaired people's for exampld) should be acknowledged as an improvement, not discouraged continually for having some bugs.

neuronexmachina2y ago

The LLM prompts are pretty interesting, e.g.: https://github.com/imelnyk/ArxivPapers/blob/main/gpt/utils.p...

> "You are an ArXiv paper audio paraphraser. Your primary goal is to rephrase the original paper content while preserving its overall meaning and structure, but simplifying along the way, and make it easier to understand. In the event that you encounter a mathematical expression, it is essential that you verbalize it in straightforward nonlatex terms, while remaining accurate, and in order to ensure that the reader can grasp the equation's meaning solely through your verbalization. Do not output any long latex expressions, summarize them in words."

nicklecompte2y ago

There's no way the prompt actually works, though. LLMs are not able to reliably "preserve the overall meaning" of things unless they're doing direct technical translation. The problem is going to be even worse with original research, because the LLM will try to summarize according to old ideas from blog posts / etc in its training data, and not the new ideas in the original research. In general document summarization is one of the worst use cases for LLMs, both in terms of its reliability and the difficulty of finding errors - how would you know without reading the paper? I would be surprised if this prompt worked on a single honest[1] paper that was written after the LLM was pretrained.

The bit about translating LaTeX expressions into human-comprehensible math sentences is interesting and AFAIK should work on something like GPT-4. But that's just a case of technical translation. GPT-4 definitely cannot "rephrase the overall paper... simplifying along the way." GPT-4 can't even summarize corporate reports without screwing up facts and figures - why on earth would you try to use it to summarize new scientific research?

Stuff like this is why I'm so concerned about LLMs: this prompt doesn't work, and people using AI for this stuff is just automating ignorance. Very frustrating.

[1] I say "honest" because this prompt would probably do ok on stuff coming out of a paper mill - the problem is carefully stated original ideas. GPT tears original ideas to shreds.

bionhoward2y ago

Can somebody please pay me a nickel every time someone states a belief LLMs cannot perform some task or another?

1 more reply

Almondsetat2y ago

Papers are already difficult to process when reading them carefully multiple times, what even is the point of turning them into an audio version? I am genuinely at a loss, unless we are talking about blind people

julienchastang2y ago

The YouTube Channel may shed some light. As I understand this, it is not reading the paper, but interpreting or summarizing it with visual cues as to which section it is analyzing.

Almondsetat2y ago

I still don't get the purpose. If you have a video to watch it's not an audiobook anymore. Secondly, why not just read the abstract? The paper might contain formulas (need to be carefully read to understand) and data (need to be carefully read to understand). If you strip the paper of its scientific elements then only a series of badly justified steps remain, at which point you might as well just consider the abstract + conclusions paragraphs

2 more replies

se4u2y ago

Many years ago, I did that when I had a large paper reviewing load during my phd. My solution was simply to purchase an app called SayIt for like a dollar that read the pdf to me, worked really well.

Nowadays I often pass the pdf through LLMs to get personalize (expand on jargon or contract the verbiage) and then read them. That gives me a better return on time spent.

AcsmaggartOP2y ago

I had been daydreaming a couple of weeks ago about being able to listen to papers while driving or doing repetitive tasks, and it looks like there is now a YouTube channel where these get posted:

https://www.youtube.com/@ArxivPapers

The pipeline seems to do a pretty good job of cleaning up the writing too, some ArXiv papers are a little rough.

(I'm not the project owner)

pulpfictional2y ago

I've been looking for a good way to TTS longer PDFs and EPUBs into recordings so I may listen to them on the go. I'd like to take advantage of high quality TTS models but I'd prefer it to be one I may host myself.

Haven't found the right way yet, I'm considering: https://github.com/MycroftAI/mimic3

hagbard_c2y ago

I use Librera Reader [1] for this, it handles both ePub as well as PDF and then some. The quality of the TTS output is dependent on what you have on your (Android) device since that is what it uses. I tend to use Google's TTS with a male UK voice which I tune down (as in deeper voice) and speed up a bit. It mostly works fine, probably better for nonfiction than fiction but that is what I mostly use it for anyway. You can swap between reading on-screen and listening since it keeps position in the document while reading aloud.

You can also have it read into an audio file is so desired which can be listened to later.

[1] https://f-droid.org/en/packages/com.foobnix.pro.pdf.reader/

adi42132y ago

If you’re an iOS user, try https://oration.app

LeafItAlone2y ago

Please don’t use HN primarily to promote your product. It is against the community guidelines.

1 more reply

mrkramer2y ago

I had a similar idea but what happens when you stumble upon code, equations, tables, graphs etc.? Can LLM understand that as well?

For example; you are listening to the paper with some text2speech model and then it stumbles open code snippet or table or graph....what should happen next? Should model skip it or prompt you to look at the graph or table or whatever. Or should you write some software that tries to interpret graphs and other non-text content.

julienchastang2y ago

I am still trying to understand this, but it seems like the potential here is tremendous. For example, you can imagine producing audio tailored to the sophistication of the reader where a layperson may wish a more basic interpretation than a subject-matter expert. Really looking forward to seeing where this goes for the dissection and understanding of scientific publications.

mdaniel2y ago

Did you purposefully omit a license?

I really do wish GitHub would prompt its repo owners "did you forget a license?", but I also wish it would prompt them for adding "topics" to enhance discovery and I guess I'll just continue to hold my breath on those

josh-sematic2y ago

https://www.listening.com/ does this as a service.FWIW I haven’t tried it myself.

Edit: looks like they support a few traditional publishers as well.

Nowado2y ago

Last time I tried it, app literally just read papers. As in parsed arxiv pdfs text2speech. It was an awful misunderstanding of the medium. Unless it was rebuilt significantly over last months, it's just bad.

adi42132y ago

We built Oration (https://oration.app) to improve on issues like this. It also generates a summarized version

adi42132y ago

Give oration (https://oration.app) a try! It’s cheaper and many of our users found it a better option than Listening

ipsum22y ago

Whatever they're using for text to speech is rough. Probably using an open source model. The one used in OP (Google's) is a lot more listenable.

neuronexmachina2y ago

It'd be interesting to also have these generate a slide presentation explaining a paper via some combination of presentation markdown, MermaidJS, and an image generator.

calebkaiser2y ago

I started working on a version of this just the other night—thank you for saving me the time! This is awesome.

mathgradthrow2y ago

Audiobooks make sense for thibgs which are communicated as fast as speech. Like stories.

j / k navigate · click thread line to collapse

41 comments

Uehreka2y ago

chaxor2y ago

There is definetly a way to make this happen though. Little bit o' whisper, Mixtral in some RAG, and you've got yourself a buddy to talk about the paper while it's reading it to you.

3abiton2y ago

1 more reply

vibrio2y ago

I came to post essentially this. I could listen to review articles in a area I'm familiar with, but listening to primary papers could never work for me.

Gooblebrai2y ago

It also really depends on what type of papers. Psychology papers are very accessible to audio format as are, in general, quite comprehensive.

rhelz2y ago

Just listening while doing nothing else is soporific, but I can imagine finding this invaluable if I had a long commute to work.

beacon2942y ago

I use a combined approach of pre listening then reading the technical writing later sometimes

reader50002y ago

Visual reading of dense papers also leads to failing to understand concepts or distractions.

chaxor2y ago

This is a great point. People will complain if LMs are applying to anything, but ultimately it improves accessibility, and allows for someone to dive deeper when needed.

There will always be ways to misinterpret some academic work, and there are plenty of opportunities in the path of understanding a work to do that.

neuronexmachina2y ago

The LLM prompts are pretty interesting, e.g.: https://github.com/imelnyk/ArxivPapers/blob/main/gpt/utils.p...

nicklecompte2y ago

Stuff like this is why I'm so concerned about LLMs: this prompt doesn't work, and people using AI for this stuff is just automating ignorance. Very frustrating.

[1] I say "honest" because this prompt would probably do ok on stuff coming out of a paper mill - the problem is carefully stated original ideas. GPT tears original ideas to shreds.

bionhoward2y ago

Can somebody please pay me a nickel every time someone states a belief LLMs cannot perform some task or another?

1 more reply

Almondsetat2y ago

julienchastang2y ago

The YouTube Channel may shed some light. As I understand this, it is not reading the paper, but interpreting or summarizing it with visual cues as to which section it is analyzing.

Almondsetat2y ago

2 more replies

se4u2y ago

Many years ago, I did that when I had a large paper reviewing load during my phd. My solution was simply to purchase an app called SayIt for like a dollar that read the pdf to me, worked really well.

Nowadays I often pass the pdf through LLMs to get personalize (expand on jargon or contract the verbiage) and then read them. That gives me a better return on time spent.

AcsmaggartOP2y ago

I had been daydreaming a couple of weeks ago about being able to listen to papers while driving or doing repetitive tasks, and it looks like there is now a YouTube channel where these get posted:

https://www.youtube.com/@ArxivPapers

The pipeline seems to do a pretty good job of cleaning up the writing too, some ArXiv papers are a little rough.

(I'm not the project owner)

pulpfictional2y ago

Haven't found the right way yet, I'm considering: https://github.com/MycroftAI/mimic3

hagbard_c2y ago

You can also have it read into an audio file is so desired which can be listened to later.

[1] https://f-droid.org/en/packages/com.foobnix.pro.pdf.reader/

adi42132y ago

If you’re an iOS user, try https://oration.app

LeafItAlone2y ago

Please don’t use HN primarily to promote your product. It is against the community guidelines.

1 more reply

mrkramer2y ago

I had a similar idea but what happens when you stumble upon code, equations, tables, graphs etc.? Can LLM understand that as well?

julienchastang2y ago

mdaniel2y ago

Did you purposefully omit a license?

josh-sematic2y ago

https://www.listening.com/ does this as a service.FWIW I haven’t tried it myself.

Edit: looks like they support a few traditional publishers as well.

Nowado2y ago

adi42132y ago

We built Oration (https://oration.app) to improve on issues like this. It also generates a summarized version

adi42132y ago

Give oration (https://oration.app) a try! It’s cheaper and many of our users found it a better option than Listening

ipsum22y ago

Whatever they're using for text to speech is rough. Probably using an open source model. The one used in OP (Google's) is a lot more listenable.

neuronexmachina2y ago

It'd be interesting to also have these generate a slide presentation explaining a paper via some combination of presentation markdown, MermaidJS, and an image generator.

calebkaiser2y ago

I started working on a version of this just the other night—thank you for saving me the time! This is awesome.

mathgradthrow2y ago

Audiobooks make sense for thibgs which are communicated as fast as speech. Like stories.

j / k navigate · click thread line to collapse