undefined | Better HN

0 pointsthrowaway44967mo ago0 comments

But all those problems exist when rendering into a surface or rastering. I just don't understand how one thinks, this is a hard problem, let me make it harder by solving the problem into another kind of problem that is just as hard as solving it in the first place (PDF to structured data vs PDF to raster). And then solve the new problem, which is also hard. It is absurd.

0 comments

DannyBee7mo ago

The problems don't actually exist in the way you think.

When extracting text directly, the goal is to put it back into content order, regardless of stream order. Then turn that into a string. As fast as possible.

That's straight text. if you want layout info, it does more. But it's also not just processing it as a straight stream and rasterizing the result. It's trying to avoid doing that work.

This is non-trivial on lots of pdfs, and a source of lots of parsing issues/errors because it's not just processing it all and rasterizing it, but trying to avoid doing that.

When rasterizing, you don't care about any of this at all. PDFs were made to raster easily. It does not matter what order the text is in the file, or where the tables are, because if you parse it straight through, raster, and splat it to the screen, it will be in the proper display order and look right.

So if you splat it onto the screen, and then extract it, it will be in the proper content/display order for you. Same is true of the tables, etc.

So the direct extraction problems don't exist if you can parse the screen into whatever you want, with 100% accuracy (and of course it doesn't matter if you use AI or not to do it).

Now, i am not sure i would use this method anyway, but your claim that the same problems exist is definitely wrong.

quinnjh7mo ago

I don’t think people are suggesting : Build a renderer > build an ocr pipeline > run it on pdfs

I think people are suggesting : Use a readymade renderer > use readymade OCR pipelines/apis > run it on pdfs

A colleague uses a document scanner to create a pdf of a document and sends it to you

You must return the data represented in it retaining as much structure as possible

How would you proceed? Return just the metadata of when the scan was made and how?

Genuinely wondering

throwaway4496OP7mo ago

You can use an existing readymade renderer to render into structured data instead of raster.

kwon-young7mo ago

Just to illustrate this point, poppler [1] (which is the most popular pdf renderer in open source) has a little tool called pdf2cairo [2] which can render a pdf into a svg. This means you can delegate all pdf rendering to poppler and only work with actual graphical objects to extract semantics.

I think the reason this method is not popular is that there are still many ways to encode a semantic object graphically. A sentence can be broken down into words or letters. Table lines can be formed from multiple smaller lines, etc. But, as mentioned by the parent, rule based systems works reasonably well for reasonably focused problems. But you will never have a general purpose extractor since rules needs to be written by humans.

[1] https://poppler.freedesktop.org/ [2] https://gitlab.freedesktop.org/poppler/poppler/-/blob/master...

2 more replies

wybiral7mo ago

Sometimes scanned documents are structured really weird, especially for tables. Visually, we can recognize the intention when it's rendered, and so can the AI, but you practically have to render it to recover the spatial context.

throwaway4496OP7mo ago

But why do you have to render it into bitmap?

rcxdude7mo ago

PDF to raster seems a lot easier than PDF to structured data, at least in terms of dealing with the odd edge cases. PDF is designed to raster consistently, and if someone generates something that doesn't raster in enough viewers, they'll fix it. PDF does not have anything that constrains generators to a sensible structured representation of the information in the document, and most people generating PDF documents are going to look at the output, not run it through a system to extract the structured data.

j / k navigate · click thread line to collapse

0 comments

DannyBee7mo ago

The problems don't actually exist in the way you think.

When extracting text directly, the goal is to put it back into content order, regardless of stream order. Then turn that into a string. As fast as possible.

That's straight text. if you want layout info, it does more. But it's also not just processing it as a straight stream and rasterizing the result. It's trying to avoid doing that work.

This is non-trivial on lots of pdfs, and a source of lots of parsing issues/errors because it's not just processing it all and rasterizing it, but trying to avoid doing that.

So if you splat it onto the screen, and then extract it, it will be in the proper content/display order for you. Same is true of the tables, etc.

So the direct extraction problems don't exist if you can parse the screen into whatever you want, with 100% accuracy (and of course it doesn't matter if you use AI or not to do it).

Now, i am not sure i would use this method anyway, but your claim that the same problems exist is definitely wrong.

quinnjh7mo ago

I don’t think people are suggesting : Build a renderer > build an ocr pipeline > run it on pdfs

I think people are suggesting : Use a readymade renderer > use readymade OCR pipelines/apis > run it on pdfs

A colleague uses a document scanner to create a pdf of a document and sends it to you

You must return the data represented in it retaining as much structure as possible

How would you proceed? Return just the metadata of when the scan was made and how?

Genuinely wondering

throwaway4496OP7mo ago

You can use an existing readymade renderer to render into structured data instead of raster.

kwon-young7mo ago

[1] https://poppler.freedesktop.org/ [2] https://gitlab.freedesktop.org/poppler/poppler/-/blob/master...

2 more replies

wybiral7mo ago

throwaway4496OP7mo ago

But why do you have to render it into bitmap?

rcxdude7mo ago

j / k navigate · click thread line to collapse