Because you're right if you're paid to evaluate all the formats with the Mark 1 eyeball and do a custom parser for each. It sounds like it's feasible for your application.
If you want a generic solution that doesn't rely on a human spending a week figuring out that those 4 absolutely positioned text fields are the invoice number together (and in order 1 4 2 3), maybe you're wrong.
Source: I don't parse pdfs for a living, but sometimes I have to select text out of pdf schematics. A lot of times I just give up and type what my Mark 1 eyeball sees in a text editor.
So it is absurd to pretend you can solve the rendering problem by rendering it into an image instead of a structured format. By rendering it into a raster, now you have 3 problems, parsing the PDF, rendering quality raster, then OCR'ing the raster. It is mind numbingly absurd.
If your PDF renders a part of the sentence at the beginning of the document, a part in the middle, and a part at the end, split between multiple sections, it's still rather trivial to render.
To parse and understand that this is the same sentence? A completely different matter.
We also parse millions of PDFs per month in all kinds languages (both Western and Asian alphabets).
Getting the basics of PDF parsing to work is really not that complicated -- A few months work. And is an order of magnitude more efficient than generating an image in 300-600 DPI and doing OCR or Visual LLM.
But some of the challenges (which we have solved) are:
• Glyphs to unicode tables are often limited or incorrect • "Boxing" blocks of text into "paragraphs" can be tricky • Handling extra spaces and missing spaces between letters and words. Often PDFs do not include the spaces or they are incorrect so you need to identify gaps yourself. • Often graphic designers of magazines/newspapers will hide text behind e.g. a simple white rectangle, and place new version of the text above. So you need to keep track of z-order and ignore hidden text. • Common text can be embedded as vector paths -- Not just logos but we also see it with text. So you need a way to handle that. • Dropcap and similar "artistic" choices can be a bit painful
There are lot of other smaller issues -- but they are generally edge cases.
OCR handles some of these issues for you. But we found that OCR often misidentifies letters (all major OCR), and they are certainly not perfect with spaces either. So if you are going for quality, you can get better results if you parse the PDFs.
Visual Transformers are not good with accurate coordinates/boxing yet -- At least we haven't seen a good enough implementation of it yet. Even though it is getting better.
We process several million pages from Newspapers and Magazines from all over the world with medium to very high complexity layouts.
We built the PDF parser on top of open source PDF libraries, and this gives many advantages: • We can accurately get headlines other text placed on top on images. OCR is generally hopeless with text placed on top of images or on complex backgrounds • Distinguish letters accurately (i.e. number 1, I, l, "o", "zero") • OCR will pick up ghost letters from images, where OCR program believes there is text, even if there isn't. We don't. • We have much higher accuracy than OCR because we don't depend on the OCR programs' ability to recognize the letters. • We can utilize font information and accurate color information, which helps us distinguish elements from each other. • We have accurate bounding box locations of each letter, word, line, and block (pts).
To do it, we completely abandon the PDF text-structure and only use the individual location of each letter. Then we combine letter positions to words, words to lines, and lines to text-blocks using a number of algorithms.
We use the structure blocks that we generated with machine learning afterwards, so this is just the first step in analyzing the page.
It may seem like a large undertaking, but it literally only took a few months to built this initially, and we have very rarely touched the code over the last 10 years. So it was a very good investment for us.
Obviously, you can achieve a lot of the same with OCR -- But you lose information, accuracy, and computational efficiency. And you depend on the OCR program you use. Best OCR programs are commercial and somewhat pricy at scale.