Show HN: Open-source Rule-based PDF parser for RAG (opens in new tab)

(github.com)

293 pointsjnathsf2y ago32 comments

The PDF parser is a rule based parser which uses text co-ordinates (boundary box), graphics and font data. The PDF parser works off text layer and also offers a OCR option to automatically use OCR if there are scanned pages in your PDFs. The OCR feature is based off a modified version of tika which uses tesseract underneath.

The PDF Parser offers the following features:

* Sections and subsections along with their levels. * Paragraphs - combines lines. * Links between sections and paragraphs. * Tables along with the section the tables are found in. * Lists and nested lists. * Join content spread across pages. * Removal of repeating headers and footers. * Watermark removal. * OCR with boundary boxes

32 comments

dmezzetti2y ago

One additional library to add, if you're working with scientific papers: https://github.com/kermitt2/grobid. I use this with paperetl (https://github.com/neuml/paperetl).

dmezzetti2y ago

Nice project! I've long used Tika for document parsing given it's maturity and wide number of formats supported. The XHTML output helps with chunking documents for RAG.

Here's a couple examples:

- https://neuml.hashnode.dev/build-rag-pipelines-with-txtai

- https://neuml.hashnode.dev/extract-text-from-documents

Disclaimer: I'm the primary author of txtai (https://github.com/neuml/txtai).

mpeg2y ago

Off-topic, but do you know how Tika compares to other pdf parsing libraries? I was very unimpressed by pdfminer.six (what unstructured uses) as the layout detection seems pretty basic, it fails to parse multi column text, whereas MuPDF does it perfectly

Currently I'm using a mix of MuPDF + AWS Textract (for tables, mostly) but I'd love to understand what other people are doing

jahewson2y ago

Tika uses PDFBox under the hood, using its built-in text extractor (which is "ok"). If you're looking for table extraction specifically, check out Tabula (https://tabula.technology) which is also built on top of PDFBox and has some contributions from the same maintainers. PDFBox actually exposes a lower-level API for text extraction (I wrote it!) than the one Tabula uses, allowing you to roll your own extractor - but that's where dragons live, trust me :)

dmezzetti2y ago

I don't have scientific metrics but I've found the quality much better than most. It does a pretty good job to pulling data from text and tables.

epaga2y ago

This looks like it could be very helpful. The company I work for has a PDF comparison tool called "PDFC" which can read PDFs and runs comparisons of semantic differences. https://www.inetsoftware.de/products/pdf-content-comparer

Parsing PDFs can be quite the headache because the format is so complex. We support most of these features already but there are always so many edge cases that additional angles can be very helpful.

muzamil-ali2y ago

You're absolutely right; parsing PDFs can be a real headache due to their inherent complexity. The format itself can vary in structure, layout, and embedded components, making it difficult to extract and compare information consistently. Even with robust tools like PDFC, edge cases can always emerge, requiring further refinements.

lmeyerov2y ago

Tesseract OCR fallback sounds great!

There are now a lot of file loaders for RAG (langchain, LLMindex, unstructured, ...), any reasons, like a leading benchmark score, to prefer this one?

mpeg2y ago

I couldn't try this tool as it doesn't build on apple silicon (and there's no ARM docker image)

However, I have a PDF parsing use-case that I tried those RAG tools for, but the output they give me is pretty low quality – it kinda works for RAG as the LLM can work around the issues but if you want to get higher quality responses with proper references and such I think the best way is to write your own rule-based parser which is what I ended up doing (based on MuPDF though, not Tika).

Maybe that's what the authors of this tool were thinking too.

asukla2y ago

To run the docker image on apple silicon, you can use the following command to pull - it will be slower but works: docker pull --platform linux/x86_64 ghcr.io/nlmatics/nlm-ingestor:latest

mpeg2y ago

Thanks, I always forget I can do that! I've given it a go and it's really impressive – the default chunker is very smart and manages to keep most of the chunk context together

The table parser in particular is really good. Is the trick that you draw some guide lines and rectangles around tables? I'm trying to understand the GraphicsStreamProcessor class as I'm not familiar with Tika, how does it know where to draw in the first place?

ramoz2y ago

For me, PyMuPDF/fitz has been the best way to retain natural reading order and set dynamic enough rules to extract text in complex layouts.

None of the mentioned tools did this out of the box, none seemed easy to configured, all definitely hyped and marketed way beyond fitz though.

mpeg2y ago

Same here, fitz is great, it does well enough out of the box that I can apply some simple heuristics for things like joining/splitting paragraphs where it makes a mistake and extract drawings and such and get pretty close to 100% accuracy on the output.

The only thing it doesn't do is tables detection (neither does pdfminer.six), but there are plenty of other ways to handle them.

rmsaksida2y ago

Last time I tried Langchain (admittedly, that was ~6 months ago) the implementations for content extraction from PDFs and HTML files were very basic. Enough to get a prototype RAG solution going, but not enough to build anything reliable. This looks like a much more battle-tested implementation.

mistrial92y ago

great effort and very interesting. However, I go to Github and I see "This organization has no public members" .. I do not know who you are at all, or what else might be part of this without disclosure.

Overall, I believe there has to be some middle ground for identification and trust building over time, between "hidden group with no names on $CORP secure site" and other traditional means of introduction and trust building.

thanks for posting this interesting and relevant work

asukla2y ago

Thanks for the post. Please use this server with the llmsherpa LayoutPDFReader to get optimal chunks for your LLM/RAG project: https://github.com/nlmatics/llmsherpa. See examples and notebook in the repo.

firtoz2y ago

Thank you for sharing. Are there some example input output pairs somewhere?

asukla2y ago

You can use the library in conjunction with llmsherpa LayoutPDFReader.

Some examples are here with notebook: https://github.com/nlmatics/llmsherpa Here's another notebook from the repo with examples: https://github.com/nlmatics/nlm-ingestor/blob/main/notebooks...

huqedato2y ago

I tried to parse a few hundreds pdfs with it. The results are pretty decent. If this was developed in Julia, it would be ten times faster (at least).

guidedlight2y ago

How does this differ from Azure Document Intelligence, or are they effectively the same thing?

asukla2y ago

No, we are not doing the same thing. Most cloud parsers use a vision model and they are lot slower, expensive and you need to write code on the top of these to extract good chunks.

You can use llmsherpa library - https://github.com/nlmatics/llmsherpa with this server to get nice layout friendly chunks for your LLM/RAG project.

ramoz2y ago

There’s no ocr or ai involved here (other than the standard fallback).

What this library, and something like fitz/pymupdf, allow you to do is extract the text straight from the pdf, using rules about how to parse & structure it. (Most modern pdfs you can extract text without ocr).

- much cheaper obviously but doesn’t scale (across dynamic layouts) well so you likely are using this when you can configure around a standard structure. I have found rule-based text extraction to work fairly dynamically though for things like scientific pdfs.

StrauXX2y ago

Last I used it, Azure Document Intelligence wasn't all that smart about choosing split points. This seems to implement better heuristics.

asukla2y ago

I wrote about split points and the need for including section hierarchy in this post: https://ambikasukla.substack.com/p/efficient-rag-with-docume...

All this is automated in the llmsherpa parser https://github.com/nlmatics/llmsherpa which you can use as an API over this library.

infecto2y ago

What is a split point? I use Textract a lot and from my testing, always beats out any of the open source tooling to extract information. That could also be highly dependent on the document format.

batch122y ago

I think it is a reference to the place a larger document is split into chunks for calculating embeddings and storage.

cdolan2y ago

I am also curious about this. ADI is reliable but does have edge case issues on malformed PDF

I fear tesseract OCR is a potential limitation though. I’ve seen it make so many mistakes

jvdvegt2y ago

Do you ave any examples? There doesn't seem to be a single PDF file in the repo.

asukla2y ago

You can see examples in llmsherpa project - https://github.com/nlmatics/llmsherpa. This project nlm-ingestor provides you the backend to work with llmsherpa. The llmsherpa library is very convenient to use for extracting nice chunks for your LLM/RAG project.

xfalcox2y ago

We've been looking for something exactly like this, thanks for sharing!

ilaksh2y ago

How does this compare to PaddleOCR?

Looks like Apache 2 license which is nice.

genewitch2y ago

"Retrieval Augmented Generation"

j / k navigate · click thread line to collapse

32 comments

dmezzetti2y ago

One additional library to add, if you're working with scientific papers: https://github.com/kermitt2/grobid. I use this with paperetl (https://github.com/neuml/paperetl).

dmezzetti2y ago

Nice project! I've long used Tika for document parsing given it's maturity and wide number of formats supported. The XHTML output helps with chunking documents for RAG.

Here's a couple examples:

- https://neuml.hashnode.dev/build-rag-pipelines-with-txtai

- https://neuml.hashnode.dev/extract-text-from-documents

Disclaimer: I'm the primary author of txtai (https://github.com/neuml/txtai).

mpeg2y ago

Currently I'm using a mix of MuPDF + AWS Textract (for tables, mostly) but I'd love to understand what other people are doing

jahewson2y ago

dmezzetti2y ago

I don't have scientific metrics but I've found the quality much better than most. It does a pretty good job to pulling data from text and tables.

epaga2y ago

Parsing PDFs can be quite the headache because the format is so complex. We support most of these features already but there are always so many edge cases that additional angles can be very helpful.

muzamil-ali2y ago

lmeyerov2y ago

Tesseract OCR fallback sounds great!

There are now a lot of file loaders for RAG (langchain, LLMindex, unstructured, ...), any reasons, like a leading benchmark score, to prefer this one?

mpeg2y ago

I couldn't try this tool as it doesn't build on apple silicon (and there's no ARM docker image)

Maybe that's what the authors of this tool were thinking too.

asukla2y ago

To run the docker image on apple silicon, you can use the following command to pull - it will be slower but works: docker pull --platform linux/x86_64 ghcr.io/nlmatics/nlm-ingestor:latest

mpeg2y ago

Thanks, I always forget I can do that! I've given it a go and it's really impressive – the default chunker is very smart and manages to keep most of the chunk context together

ramoz2y ago

For me, PyMuPDF/fitz has been the best way to retain natural reading order and set dynamic enough rules to extract text in complex layouts.

None of the mentioned tools did this out of the box, none seemed easy to configured, all definitely hyped and marketed way beyond fitz though.

mpeg2y ago

The only thing it doesn't do is tables detection (neither does pdfminer.six), but there are plenty of other ways to handle them.

rmsaksida2y ago

mistrial92y ago

thanks for posting this interesting and relevant work

asukla2y ago

firtoz2y ago

Thank you for sharing. Are there some example input output pairs somewhere?

asukla2y ago

You can use the library in conjunction with llmsherpa LayoutPDFReader.

Some examples are here with notebook: https://github.com/nlmatics/llmsherpa Here's another notebook from the repo with examples: https://github.com/nlmatics/nlm-ingestor/blob/main/notebooks...

huqedato2y ago

I tried to parse a few hundreds pdfs with it. The results are pretty decent. If this was developed in Julia, it would be ten times faster (at least).

guidedlight2y ago

How does this differ from Azure Document Intelligence, or are they effectively the same thing?

asukla2y ago

No, we are not doing the same thing. Most cloud parsers use a vision model and they are lot slower, expensive and you need to write code on the top of these to extract good chunks.

You can use llmsherpa library - https://github.com/nlmatics/llmsherpa with this server to get nice layout friendly chunks for your LLM/RAG project.

ramoz2y ago

There’s no ocr or ai involved here (other than the standard fallback).

StrauXX2y ago

Last I used it, Azure Document Intelligence wasn't all that smart about choosing split points. This seems to implement better heuristics.

asukla2y ago

I wrote about split points and the need for including section hierarchy in this post: https://ambikasukla.substack.com/p/efficient-rag-with-docume...

All this is automated in the llmsherpa parser https://github.com/nlmatics/llmsherpa which you can use as an API over this library.

infecto2y ago

What is a split point? I use Textract a lot and from my testing, always beats out any of the open source tooling to extract information. That could also be highly dependent on the document format.

batch122y ago

I think it is a reference to the place a larger document is split into chunks for calculating embeddings and storage.

cdolan2y ago

I am also curious about this. ADI is reliable but does have edge case issues on malformed PDF

I fear tesseract OCR is a potential limitation though. I’ve seen it make so many mistakes

jvdvegt2y ago

Do you ave any examples? There doesn't seem to be a single PDF file in the repo.

asukla2y ago

xfalcox2y ago

We've been looking for something exactly like this, thanks for sharing!

ilaksh2y ago

How does this compare to PaddleOCR?

Looks like Apache 2 license which is nice.

genewitch2y ago

"Retrieval Augmented Generation"

j / k navigate · click thread line to collapse