Because PDFs might not have the data in a structured form; how would you get the structured data out of an image in the PDF?
Misspellings, default names, a mixture, home brew naming schemes, meticulous schemes, I’ve seen it all. It’s definitely easier to just rasterize it and OCR it.
PDFs don't always use UTF-8, sometimes they assign random-seeming numbers to individual glyphs (this is common if unused glyphs are stripped from an embedded font, for example)
etc etc
I suggest spending a few minutes using a PDF editor program with some real-world PDFs, or even just copying and pasting text from a range of different PDFs. These files are made up of cute-tricks and hacks that whatever produced them used to make something that visually works. The high-quality implementations just put the pixels where they're told to. The underlying "structured data" is a lie.
EDIT: I see from further down the thread that your experience of PDFs comes from programmatically generated invoice templates, which may explain why you think this way.
We have algorithms that combines the individual letters to words, words to lines, lines to boxes all by looking at it geometrically. Obviously identify the spaces between words.
We handle hidden text and problematic glyph-to-unicode tables.
The output is similar to OCR except we don't do the rasterization and quality is higher because we don't depend on vision based text recognition.
The base implementation of all this, I made in less than a month 10 years ago and we rarely, if ever, touch it.
We do machine learning afterwards on the structure output too.
Because the underlying "structured data" is never checked while the visual output is checked by dozens of people.
"Truth" is the stuff that the meatbags call "truth" as seen by their squishy ocular balls--what the computer sees doesn't matter.
The other thing is segmenting a document and linearizing it so that an LLM can understand the content better. Layout understanding helps with figuring out the natural reading order of various blocks of the page.
> There are many cases images are exported as PDFs.
One client of a client would print out her documents, then "scan" them with an Android app (actually just a photograph wrapped in a PDF). She was taught that this application is the way to create PDF files, and would staunchly not be retrained. She came up with this print-then-photograph after being told not to photograph the computer monitor - that's the furthest retraining she was able to absorb.Be there no mistake, this woman was extremely successful at her field. Successful enough to be a client of my client. But she was taught that PDF equals that specific app, and wasn't going to change her workflow to accommodate others.
You might think of your post as a <div>. Some kind of paragraph or box of text in which the text is laid out and styles applied. That's how HTML does it.
PDF doesn't necessarily work that way. Different lines, words, or letters can be in entirely different places in the document. Anything that resembles a separator, table, etc can also be anywhere in the document and might be output as a bunch of separate lines disconnected from both each other and the text. A renderer might output two-column text as it runs horizontally across the page so when you "parse" it by machine the text from both columns gets interleaved. Or it might output the columns separately.
You can see a user-visible side-effect of this when PDF text selection is done the straightforward way: sometimes you have no problem selecting text. In other documents selection seems to jump around or select abject nonsense unrelated to cursor position. That's because the underlying objects are not laid out in a display "flow" the way HTML does by default so selection is selecting the next object in the document rather than the next object by visual position.
If you were leading Tensorlake, running on early stage VC with only 10 employees (https://pitchbook.com/profiles/company/594250-75), you'd focus all your resources on shipping products quickly, iterating over unseen customer needs that could make the business skyrocket, and making your customers so happy that they tell everyone and buy lots more licenses.
Because you're a stellar tech leader and strategist, you wouldn't waste a penny reinventing low-level plumbing that's available off-the-shelf, either cheaply or as free OSS. You'd be thinking about the inevitable opportunity costs: If I build X then I can't build Y, simply because a tiny startup doesn't have enough resources to build X and Y. You'd quickly conclude that building a homegrown, robust PDF parser would be an open-ended tar pit that precludes us from focusing on making our customers happy and growing the business.
And the rest of us would watch in awe, seeing truly great tech leadership at work, making it all look easy.
Only three of them can process all 2500 files i tried (which are just machine manuals from major manufacaturers, so not highly weird shit) without hitting errors, let alone producing correct results.
About 10 of them have a 5% or less failure rate on parsing the files (let alone extracting text). This is horrible.
It then goes very downhill.
I'm retired, so i have time to fuck around like this. But going into it, there is no way i would have expected these results, or had time to figure out which 3 libraries could actually be used.
> just using the "quality implementation"?
What is the quality implementation?