undefined | Better HN

0 pointsdaemonologist7mo ago0 comments

Yes, but a lot of the improvement is coming from layout models and/or multimodal LLMs operating directly on the raster images, as opposed to via classical OCR. This gets better results because the PDF format does not necessarily impart reading order or semantic meaning; the only way to be confident you're reading it like a human would is to actually do so - to render it out.

Another thing is that most document parsing tasks are going to run into a significant volume of PDFs which are actually just a bunch of scans/images of paper, so you need to build this capability anyways.

TL;DR: PDFs are basically steganography

0 comments

throwaway44967mo ago

Hard no.

LLMs aren't going to magically do more than what your PDF rendering engine does, rastering it and OCR'ing doesn't change anything. I am amazed at how many people actually think it is a sane idea.

protomikron7mo ago

I think there is some kind of misunderstanding. Sure, if you get somehow structured, machine-generated PDFs parsing them might be feasible.

But what about the "scanned" document part? How do you handle that? Your PDF rendering engine probably just says: image at pos x,y with size height,width.

So as parent says you have to OCR/AI that photo anyway and it seems that's also a feasible approach for "real" pdfs.

throwaway44967mo ago

Okay, this sounds like "because some part of the road is rough, why don't we just drive in the ditch along the road way all the way, we could drive a tank, that would solve it"?

1 more reply

j / k navigate · click thread line to collapse

0 pointsdaemonologist7mo ago0 comments

TL;DR: PDFs are basically steganography

0 comments

throwaway44967mo ago

Hard no.

LLMs aren't going to magically do more than what your PDF rendering engine does, rastering it and OCR'ing doesn't change anything. I am amazed at how many people actually think it is a sane idea.

protomikron7mo ago

I think there is some kind of misunderstanding. Sure, if you get somehow structured, machine-generated PDFs parsing them might be feasible.

But what about the "scanned" document part? How do you handle that? Your PDF rendering engine probably just says: image at pos x,y with size height,width.

So as parent says you have to OCR/AI that photo anyway and it seems that's also a feasible approach for "real" pdfs.

throwaway44967mo ago

Okay, this sounds like "because some part of the road is rough, why don't we just drive in the ditch along the road way all the way, we could drive a tank, that would solve it"?

1 more reply

j / k navigate · click thread line to collapse