The PDF Parser offers the following features:
* Sections and subsections along with their levels. * Paragraphs - combines lines. * Links between sections and paragraphs. * Tables along with the section the tables are found in. * Lists and nested lists. * Join content spread across pages. * Removal of repeating headers and footers. * Watermark removal. * OCR with boundary boxes
Here's a couple examples:
- https://neuml.hashnode.dev/build-rag-pipelines-with-txtai
- https://neuml.hashnode.dev/extract-text-from-documents
Disclaimer: I'm the primary author of txtai (https://github.com/neuml/txtai).
Currently I'm using a mix of MuPDF + AWS Textract (for tables, mostly) but I'd love to understand what other people are doing
Parsing PDFs can be quite the headache because the format is so complex. We support most of these features already but there are always so many edge cases that additional angles can be very helpful.
There are now a lot of file loaders for RAG (langchain, LLMindex, unstructured, ...), any reasons, like a leading benchmark score, to prefer this one?
However, I have a PDF parsing use-case that I tried those RAG tools for, but the output they give me is pretty low quality – it kinda works for RAG as the LLM can work around the issues but if you want to get higher quality responses with proper references and such I think the best way is to write your own rule-based parser which is what I ended up doing (based on MuPDF though, not Tika).
Maybe that's what the authors of this tool were thinking too.
None of the mentioned tools did this out of the box, none seemed easy to configured, all definitely hyped and marketed way beyond fitz though.
Overall, I believe there has to be some middle ground for identification and trust building over time, between "hidden group with no names on $CORP secure site" and other traditional means of introduction and trust building.
thanks for posting this interesting and relevant work
Some examples are here with notebook: https://github.com/nlmatics/llmsherpa Here's another notebook from the repo with examples: https://github.com/nlmatics/nlm-ingestor/blob/main/notebooks...
You can use llmsherpa library - https://github.com/nlmatics/llmsherpa with this server to get nice layout friendly chunks for your LLM/RAG project.
What this library, and something like fitz/pymupdf, allow you to do is extract the text straight from the pdf, using rules about how to parse & structure it. (Most modern pdfs you can extract text without ocr).
- much cheaper obviously but doesn’t scale (across dynamic layouts) well so you likely are using this when you can configure around a standard structure. I have found rule-based text extraction to work fairly dynamically though for things like scientific pdfs.
All this is automated in the llmsherpa parser https://github.com/nlmatics/llmsherpa which you can use as an API over this library.
I fear tesseract OCR is a potential limitation though. I’ve seen it make so many mistakes
Looks like Apache 2 license which is nice.