The project was in the end a complete failure and several people were upset at me for not delivering what I was supposed to.
In present day, with the capabilities that are now available with LLMs to extract data from PDFs I 100% would go the route of utilising AI to extract the data they wanted. Back then that did not yet exist.
OCR can take you pretty far depending on expectations, but it's never quite far enough in my experience.
The .NORM files (https://xkcd.com/2116)
We (at https://runtrellis.com/) have been building PDF processing pipeline from the ground up with LLMs and VLMs and have seen close to 100% accuracy even for tricky PDFs. The key is to use rule based engine and references to cross check the data.
In the same way now with today's AI models the task is easily achievable.
It looks perfectly nice for its role, but I didn’t use it for my last project because I need serialization as well.
https://github.com/HexFiend/HexFiend/blob/master/templates/T...
I haven't had a lot of luck thus far, except ones which allow escaping out of declarative mode over into "and then run this code"
Would be better if all of the PDF's bytes where shown. Seems like `endobj` and `xref` are not shown.
EDIT: Oh it's actually reasonably simple, great use of CSS! https://github.com/desgeeko/pdfsyntax/blob/main/docs/simple_...
What I want is a simple document format that allows embedding other files and metadata without the Adobe's bloat. I should be able to hyperlink within pages, change font-size etc without text overflowing and being able to print in a consistent manner.
It's a page description format, not a data format, so all its decisions follow from the need to ensure that you and I can both print the same 'page' even if we use different operating systems, software, printers, exact paper dimensions, etc. I suspect the main reason it holds on so well is that so many things operate in a document paradigm, where 'document' means 'collection of sheets of paper.' Everything from the After-Visit Summary from the doctor, to your car registration document already has a specific visual representation chosen to allow them to fit sensibly and precisely on sheets of paper.
Could HTML (say, with data URLs for its images and CSS so that it can stand on its own), or ePub be a better format in most ways? Sort of, but it is optimized for such a different goal that if you went in to evangelize that switch to everyone who makes PDFs today, you'd be met with frustration that the content will look a bit different on every device, and that depending on settings, even the page breaks would fall differently.
Relatedly, it's interesting to me that even Google Docs, which I suspect are printed or converted to PDF far less than half the time, defaults to the "paged" mode (see Page Setup) that shows document page borders and margins, instead of the far more useful "Pageless" mode which is more like a normal webpage that fits to window and scrolls one continuous surface endlessly.
"without text overflowing" brings with it a lot of detail. In pdf every letter/character/glyph of text can have an exact x,y position on the page (or off the page sometimes). This allows for precise positioning of content regardless of what else is going on. It is up to the application that writes the pdf to position things correctly and implement letter or word wrapping.
XPS was the closest to reimplementing PDF but microsoft didn't get enough buy in from other parties so it quietly died.
The advantage is that PDFs don't need a full program interpreter to be rendered.
The current context for me is I'm exploring various non-steganography approaches to embed metadata in photos. In the past, I've built custom formats to embed streaming data side-by-side: https://github.com/dustinfreeman/kriffer
It gives you a JSON representation of the PDF data structure. What's nice is that doesn't hide the underlying format but it takes care of a lot of the low level edge cases for you.
Thank You For Making And Sharing!
[0] https://github.com/qpdf/qpdf, https://qpdf.readthedocs.io/en/stable/
https://qpdf.readthedocs.io/en/stable/json.html
https://github.com/maximoguerrero/PDF-GPT4-JSON
PDF is such a curious format. It's not human-readable, it's not well-structured, it's not small. If it weren't for momentum and the political horse trading that Apple, Adobe and Microsoft were doing when the web went mainstream and freaked them out around 1995, I'm not sure that we'd be using it today. Postscript is better in countless ways, but since it's Turing-complete, it's not really ideal for storing static data, and to my knowledge was never extended to handle binary data well, like for embedded JPEGs. I remember trying to print a 10 MB ps file in the 1990s and it took maybe 20 minutes because the grayscale image was basically represented as a bunch of run-length encoded scan lines.
I would argue that frontend web development has reached a similar fate. It seems odd to use programming language (imperative, no less) to design media that we used to describe declaratively. If I had enjoyed success in my programming career, I would work on a declarative representation of HTML/CSS/Javascript that can represent the intersection of all existing markup across all mainstream browsers. Sort of like a mix between Markdown and CSS flexbox like Xcode's auto layout, but universal. It frankly would probably look like HTML, but with sane defaults/builtins/inheritance, as well as a way define and extend components from the beginning, similarly to how people try to use data attributes. For contrast, React and Vue come at this from the opposite direction. I'm talking about something more like htmx.
Then we could work with that format and transpile to HTML or even React Native and dump 90-99% of the boilerplate and build tooling that we use currently.