undefined | Better HN

0 pointsbee_rider7mo ago0 comments

Seems like a fairly reasonable decision given all the high quality implementations out there.

0 comments

How is it reasonable to render the PDF, rasterize it, OCR it, use AI, instead of just using the "quality implementation" to actually get structured data out? Sounds like "I don't know programming, so I will just use AI".

lelanthran7mo ago

> How is it reasonable to render the PDF, rasterize it, OCR it, use AI, instead of just using the "quality implementation" to actually get structured data out?

Because PDFs might not have the data in a structured form; how would you get the structured data out of an image in the PDF?

throwaway44967mo ago

Sir, some of our cars breaks down every now and then, so we push them, because it happens every so often and we want to avoid it, we have implemented a policy of pushing all cars instead of driving them at all times. This removes the problem of pushing cars.

reactordev7mo ago

As someone who had to parse form data from a pdf, where the pdf author named the inputs TextField1 TextField2 TextFueld3 etc.

Misspellings, default names, a mixture, home brew naming schemes, meticulous schemes, I’ve seen it all. It’s definitely easier to just rasterize it and OCR it.

icedchai7mo ago

Same. Then someone edits the form and changes the names of several inputs, obsoleting much of the previous work, some of which still needs to be maintained because multiple versions are floating around.

throwaway44967mo ago

I do PDF for a living, millions of PDFs per month, this is complete nonsense. There is no way you get better results from rastering and OCR than rendering into XML or other structured data.

2 more replies

do_not_redeem7mo ago

PDFs don't always lay out characters in sequence, sometimes they have absolutely positioned individual characters instead.

PDFs don't always use UTF-8, sometimes they assign random-seeming numbers to individual glyphs (this is common if unused glyphs are stripped from an embedded font, for example)

etc etc

throwaway44967mo ago

But all those problems exist when rendering into a surface or rastering. I just don't understand how one thinks, this is a hard problem, let me make it harder by solving the problem into another kind of problem that is just as hard as solving it in the first place (PDF to structured data vs PDF to raster). And then solve the new problem, which is also hard. It is absurd.

4 more replies

petesergeant7mo ago

> instead of just using the "quality implementation" to actually get structured data out?

I suggest spending a few minutes using a PDF editor program with some real-world PDFs, or even just copying and pasting text from a range of different PDFs. These files are made up of cute-tricks and hacks that whatever produced them used to make something that visually works. The high-quality implementations just put the pixels where they're told to. The underlying "structured data" is a lie.

EDIT: I see from further down the thread that your experience of PDFs comes from programmatically generated invoice templates, which may explain why you think this way.

joakleaf7mo ago

We do a lot of parsing of PDFs and basically break the structure into 'letter with font at position (box)' because the "structure" within the PDF is unreliable.

We have algorithms that combines the individual letters to words, words to lines, lines to boxes all by looking at it geometrically. Obviously identify the spaces between words.

We handle hidden text and problematic glyph-to-unicode tables.

The output is similar to OCR except we don't do the rasterization and quality is higher because we don't depend on vision based text recognition.

The base implementation of all this, I made in less than a month 10 years ago and we rarely, if ever, touch it.

We do machine learning afterwards on the structure output too.

1 more reply

bsder7mo ago

> How is it reasonable to render the PDF, rasterize it, OCR it, use AI, instead of just using the "quality implementation" to actually get structured data out?

Because the underlying "structured data" is never checked while the visual output is checked by dozens of people.

"Truth" is the stuff that the meatbags call "truth" as seen by their squishy ocular balls--what the computer sees doesn't matter.

throwaway44967mo ago

Your mistake is in thinking that computers "see the image", second, you somehow think the output of OCR is different from a PDF engine that renders it into structured data/text.

diptanu7mo ago

There are many cases images are exported as PDFs. Think invoices or financial statements that people send to financial services companies. Using layout understanding and OCR based techniques leads to way better results than writing a parser which relies on the files metadata.

The other thing is segmenting a document and linearizing it so that an LLM can understand the content better. Layout understanding helps with figuring out the natural reading order of various blocks of the page.

dotancohen7mo ago

  > There are many cases images are exported as PDFs.

One client of a client would print out her documents, then "scan" them with an Android app (actually just a photograph wrapped in a PDF). She was taught that this application is the way to create PDF files, and would staunchly not be retrained. She came up with this print-then-photograph after being told not to photograph the computer monitor - that's the furthest retraining she was able to absorb.

Be there no mistake, this woman was extremely successful at her field. Successful enough to be a client of my client. But she was taught that PDF equals that specific app, and wasn't going to change her workflow to accommodate others.

xenadu027mo ago

PDF is a list of drawing commands (not exactly but a useful simplification). All those draw commands from some JS lib or in SVG? Or in every other platform's API? PDF or Postscript probably did them first. The model of "there is some canvas in which I define coordinate spaces then issue commands to draw $thing at position $(x,y), scaled by $z".

You might think of your post as a <div>. Some kind of paragraph or box of text in which the text is laid out and styles applied. That's how HTML does it.

PDF doesn't necessarily work that way. Different lines, words, or letters can be in entirely different places in the document. Anything that resembles a separator, table, etc can also be anywhere in the document and might be output as a bunch of separate lines disconnected from both each other and the text. A renderer might output two-column text as it runs horizontally across the page so when you "parse" it by machine the text from both columns gets interleaved. Or it might output the columns separately.

You can see a user-visible side-effect of this when PDF text selection is done the straightforward way: sometimes you have no problem selecting text. In other documents selection seems to jump around or select abject nonsense unrelated to cursor position. That's because the underlying objects are not laid out in a display "flow" the way HTML does by default so selection is selecting the next object in the document rather than the next object by visual position.

sidebute7mo ago

> Sounds like "I don't know programming, so I will just use AI".

If you were leading Tensorlake, running on early stage VC with only 10 employees (https://pitchbook.com/profiles/company/594250-75), you'd focus all your resources on shipping products quickly, iterating over unseen customer needs that could make the business skyrocket, and making your customers so happy that they tell everyone and buy lots more licenses.

Because you're a stellar tech leader and strategist, you wouldn't waste a penny reinventing low-level plumbing that's available off-the-shelf, either cheaply or as free OSS. You'd be thinking about the inevitable opportunity costs: If I build X then I can't build Y, simply because a tiny startup doesn't have enough resources to build X and Y. You'd quickly conclude that building a homegrown, robust PDF parser would be an open-ended tar pit that precludes us from focusing on making our customers happy and growing the business.

And the rest of us would watch in awe, seeing truly great tech leadership at work, making it all look easy.

throwaway44967mo ago

I would hire someone who understands PDFs instead of doing the equivalent of printing a digital document and scanning it for "digital record keeping". Stop everything and hire someone who understands the basics of data processing and some PDF.

1 more reply

user____name7mo ago

Because PDF is as much a vector graphics format as a document format, you cannot expect the data to be reasonably structured. For example applications can convert text to vector outlines or bitmaps for practical or artistic purposes (anyone who ever had to deal with transparency "flattening" issues knows the pain), ideally they also encode the text in a seperate semantic representation. But many times PDF files are exported from "image centric" programs with image centric workflows (e.g. Illustrator, CorelDraw, Indesign, QuarkXpress, etc) where the main issue being solved for is presentational content, not semantic. For example if I receive a Word document and need to layout it so it fits into my multi column magazine layout I will take the source text and break it into seperate sections which then manually get copy and pasted into InDesign. You can import the document directly but for all kinds of practical reasons this is not the default way of working. Some asides and lists might be broken out of the original flow of text and placed in their own text field, etc. So now you lost the original semantic structure. Remember, this is how desktop publishing evolved: for print, which has no notion of structure or metadata embedded into the ink or paper. Another common usecase is to simply have resolution independent graphics, again, display purposes only, no structured data is required nor expected.

DannyBee7mo ago

I just spent a few weeks testing about 25 different pdf engines to parse files and extract text.

Only three of them can process all 2500 files i tried (which are just machine manuals from major manufacaturers, so not highly weird shit) without hitting errors, let alone producing correct results.

About 10 of them have a 5% or less failure rate on parsing the files (let alone extracting text). This is horrible.

It then goes very downhill.

I'm retired, so i have time to fuck around like this. But going into it, there is no way i would have expected these results, or had time to figure out which 3 libraries could actually be used.

koakuma-chan7mo ago

I think it's reasonable because their models are probably trained on images, and not whatever "structured data" you may get out of a PDF.

diptanu7mo ago

Yes this! We training it on a ton of diverse document images to learn reading order and layouts of documents :)

1 more reply

throwaway44967mo ago

No model can do better on images than structured data. I am not sure if I am on crack or you're all talking nonsense.

1 more reply

dotancohen7mo ago

  > just using the "quality implementation"?

What is the quality implementation?

j / k navigate · click thread line to collapse

0 comments

throwaway44967mo ago

lelanthran7mo ago

> How is it reasonable to render the PDF, rasterize it, OCR it, use AI, instead of just using the "quality implementation" to actually get structured data out?

Because PDFs might not have the data in a structured form; how would you get the structured data out of an image in the PDF?

throwaway44967mo ago

reactordev7mo ago

As someone who had to parse form data from a pdf, where the pdf author named the inputs TextField1 TextField2 TextFueld3 etc.

Misspellings, default names, a mixture, home brew naming schemes, meticulous schemes, I’ve seen it all. It’s definitely easier to just rasterize it and OCR it.

icedchai7mo ago

throwaway44967mo ago

I do PDF for a living, millions of PDFs per month, this is complete nonsense. There is no way you get better results from rastering and OCR than rendering into XML or other structured data.

2 more replies

do_not_redeem7mo ago

PDFs don't always lay out characters in sequence, sometimes they have absolutely positioned individual characters instead.

PDFs don't always use UTF-8, sometimes they assign random-seeming numbers to individual glyphs (this is common if unused glyphs are stripped from an embedded font, for example)

etc etc

throwaway44967mo ago

4 more replies

petesergeant7mo ago

> instead of just using the "quality implementation" to actually get structured data out?

EDIT: I see from further down the thread that your experience of PDFs comes from programmatically generated invoice templates, which may explain why you think this way.

joakleaf7mo ago

We do a lot of parsing of PDFs and basically break the structure into 'letter with font at position (box)' because the "structure" within the PDF is unreliable.

We have algorithms that combines the individual letters to words, words to lines, lines to boxes all by looking at it geometrically. Obviously identify the spaces between words.

We handle hidden text and problematic glyph-to-unicode tables.

The output is similar to OCR except we don't do the rasterization and quality is higher because we don't depend on vision based text recognition.

The base implementation of all this, I made in less than a month 10 years ago and we rarely, if ever, touch it.

We do machine learning afterwards on the structure output too.

1 more reply

bsder7mo ago

> How is it reasonable to render the PDF, rasterize it, OCR it, use AI, instead of just using the "quality implementation" to actually get structured data out?

Because the underlying "structured data" is never checked while the visual output is checked by dozens of people.

"Truth" is the stuff that the meatbags call "truth" as seen by their squishy ocular balls--what the computer sees doesn't matter.

throwaway44967mo ago

Your mistake is in thinking that computers "see the image", second, you somehow think the output of OCR is different from a PDF engine that renders it into structured data/text.

diptanu7mo ago

dotancohen7mo ago

  > There are many cases images are exported as PDFs.

xenadu027mo ago

You might think of your post as a <div>. Some kind of paragraph or box of text in which the text is laid out and styles applied. That's how HTML does it.

sidebute7mo ago

> Sounds like "I don't know programming, so I will just use AI".

And the rest of us would watch in awe, seeing truly great tech leadership at work, making it all look easy.

throwaway44967mo ago

1 more reply

user____name7mo ago

DannyBee7mo ago

I just spent a few weeks testing about 25 different pdf engines to parse files and extract text.

Only three of them can process all 2500 files i tried (which are just machine manuals from major manufacaturers, so not highly weird shit) without hitting errors, let alone producing correct results.

About 10 of them have a 5% or less failure rate on parsing the files (let alone extracting text). This is horrible.

It then goes very downhill.

I'm retired, so i have time to fuck around like this. But going into it, there is no way i would have expected these results, or had time to figure out which 3 libraries could actually be used.

koakuma-chan7mo ago

I think it's reasonable because their models are probably trained on images, and not whatever "structured data" you may get out of a PDF.

diptanu7mo ago

Yes this! We training it on a ton of diverse document images to learn reading order and layouts of documents :)

1 more reply

throwaway44967mo ago

No model can do better on images than structured data. I am not sure if I am on crack or you're all talking nonsense.

1 more reply

dotancohen7mo ago

  > just using the "quality implementation"?

What is the quality implementation?

j / k navigate · click thread line to collapse