Show HN: Parsing horse racing charts with Apache PDFBox (opens in new tab)

(github.com)

115 pointsrobinhowlett8y ago31 comments

31 comments

Very interesting! I had never heard of Apache PDFBox before, I must give it a try. I have a similar program that parses horse racing PDFs from sites such as www.racehorserunner.com - which are of a much simpler format, but cause endless problems for me when the PDFs have layout problems. For example, issues like one column being too long and overlapping with another, e.g the last race on http://www.racehorserunner.com/Archives/ELP/ELP170702.pdf

All PDF parsers that I have tried cope very badly with these kind of situations, and often try to be 'too clever' in that they value the final layout of the text over and above the individual strings.

Have you experienced similar problems with PDFBox, or does it handle formatting and layout fairly reliably?

jahewson8y ago

PDFBox committer here, if you want even lower-level access to the page content stream, without anything 'clever' at all, check out the PDFGraphicsStreamEngine class, which is a superclass of the text extraction and rendering classes. Gives you access to the raw glyphs. You can override PageRenderer too, for visual debugging, e.g. render glyph bounding boxes. We have an interactive Swing PDFDebugger which does just that.

https://github.com/apache/pdfbox/blob/6f18d7c4bef4d23a22d/ex...

joosters8y ago

Thanks for the guidance, I'll take a look.

robinhowlettOP8y ago

Yes I encountered similar issues but many of them were able to be solved.

With PDFBox I was able to deal with the content at a very low level (on a per-character basis), so that when for instance building a String, I would insert a pipe character when the distance between adjacent characters was greater than the width of the space character and then detect that when translating to a certain field.

See the convertToText() method for an example: https://github.com/robinhowlett/chart-parser/blob/master/src...

and https://github.com/robinhowlett/chart-parser/blob/f8d651e9a1... for when I used this technique

tcho8y ago

Very cool, good to see the level of control this package allows.

bpicolo8y ago

Huh, interesting. I was looking around for PDF libs previously and PDFBox didn't show up in google results. pdftk was the only one that showed up in Google results anywhere useful.

Edit: Looks like it's on the second page of results and I never made it that far, heh. Goes to show how biasing the first page of results is.

maxxxxx8y ago

I still don't understand how PDF could become one of the standards for publishing documents. Well structured content gets converted into PDF which loses most of that structure. And then a lot of work is done to guess that structure from PDF and convert it back to a better file format. It just shows that successful solutions don't have to be technically good.

userbinator8y ago

The keyword is "publishing" --- as in, producing human-readable physical copies, not electronic ones. It just so happens that the format was relatively suitable for the latter too (because it actually looks like a printed document rendered on the screen --- unlike HTML or other formats around at the time), which is why that use-case became popular. PDF is basically a descendant of PostScript, which was designed to control printers.

(Its PostScript origins may also explain the bizarre mix of text and binary that constitute the file format. For example, page contents are in a relatively free-form PostScript-ish RPN-like textual language, but are found in "content streams" which may be compressed or encoded into a binary format. Data "object" structures include things like '<<'-delimited dictionaries, '[' arrays ']', textual "/Names", and even provisions for comments(!?).

Then there are things like the cross-reference table of all objects in the file, which is an array of fixed-width textual numbers representing file offsets, e.g. "0000001056 00000 n" refers to something 1056 bytes from the start of the file. Reactions of WTF!? from those working with the format for the first time are not uncommon.)

amenghra8y ago

Minimal PDF explained: https://brendanzagaeski.appspot.com/0004.html

jahewson8y ago

PDF has a feature called Tagged PDF, which allows the document to be annotated with a semantic structure. Almost nobody bothers to generate such PDFs, but the support is there!

oever8y ago

Dutch law requires that official documents be published as PDF/A-1a which is a subset of PDF 1.4 that can be archived and must be tagged.

cpach8y ago

Sadly I think that often the publishers actually want it that way, i.e. the they do not want the data to be easily parsable...

beager8y ago

I think it's more that they want consistency in rendering across devices and media.

1 more reply

beager8y ago

Very neat, and gets me curious about PDFBox, but every time I see something that converts a consistent-layout PDF back to structured data, I just bemoan the fact that this would all be trivial with an API for these kinds of things.

0x4454428y ago

Great job!

I was just looking at collecting race information and historical results data a month or two ago and was struck by the lack of available structured data. Heck, I couldn't easily find any for pay options either.

Cyph0n8y ago

Firstly, what an interesting library. Secondly, this is among the best TLDR readmes I've ever seen! I lack exposure to this area, so I'm actually quite impressed with the complexity of it.

Keep up the great work.

richiverse8y ago

As a python programmer, I found R's pdftools to be indispensable for messy text based PDFs. I couldn't find a python lib that worked as consistently across variously different formats.

tunaoftheland8y ago

I came across https://github.com/pdfminer/pdfminer.six recently and was impressed with what it could get done. The documentation can be challenging to parse, so I relied on a code sample from a StackOverflow answer. Have you had a chance to try it out? Curious about how/if it works well across platforms.

hbcondo7148y ago

Impressive! Seems like you can't just use PDFBox out of the box (no pun intended) and need to write some custom code specific to the PDF itself per the chart-parser commits[1]

[1] https://github.com/robinhowlett/chart-parser/tree/master/src...

robinhowlettOP8y ago

Author here; well, PDFBox is good for simple text stripping. If I wanted to print all the text on the PDF, that would be very straightforward and not much code. However, the PDF chart here is in essence a representation of structured data. I wanted to get the content in that format so that I could both serialize to JSON plus have an SDK to boot.

JabavuAdams8y ago

Crazy! I was just looking in to this topic a few weeks ago, for a friend. Thanks!

vbuwivbiu8y ago

what I would love is an app that would reformat portrait PDFs as 2-column landscape for reading on my screen

mpweiher8y ago

Most PDF viewers I am aware of (including my own, PostView) have 2 up modes, are those not sufficient?

vbuwivbiu8y ago

that's still presenting the pages in portrait orientation. I want the pages to be landscape and for the text to flow in at least 2 columns.

1 more reply

ocrimgproc8y ago

Can it be used for invoices?

j / k navigate · click thread line to collapse

31 comments

joosters8y ago

Have you experienced similar problems with PDFBox, or does it handle formatting and layout fairly reliably?

jahewson8y ago

https://github.com/apache/pdfbox/blob/6f18d7c4bef4d23a22d/ex...

joosters8y ago

Thanks for the guidance, I'll take a look.

robinhowlettOP8y ago

Yes I encountered similar issues but many of them were able to be solved.

See the convertToText() method for an example: https://github.com/robinhowlett/chart-parser/blob/master/src...

and https://github.com/robinhowlett/chart-parser/blob/f8d651e9a1... for when I used this technique

tcho8y ago

Very cool, good to see the level of control this package allows.

bpicolo8y ago

Huh, interesting. I was looking around for PDF libs previously and PDFBox didn't show up in google results. pdftk was the only one that showed up in Google results anywhere useful.

Edit: Looks like it's on the second page of results and I never made it that far, heh. Goes to show how biasing the first page of results is.

maxxxxx8y ago

userbinator8y ago

amenghra8y ago

Minimal PDF explained: https://brendanzagaeski.appspot.com/0004.html

jahewson8y ago

PDF has a feature called Tagged PDF, which allows the document to be annotated with a semantic structure. Almost nobody bothers to generate such PDFs, but the support is there!

oever8y ago

Dutch law requires that official documents be published as PDF/A-1a which is a subset of PDF 1.4 that can be archived and must be tagged.

cpach8y ago

Sadly I think that often the publishers actually want it that way, i.e. the they do not want the data to be easily parsable...

beager8y ago

I think it's more that they want consistency in rendering across devices and media.

1 more reply

beager8y ago

0x4454428y ago

Great job!

Cyph0n8y ago

Firstly, what an interesting library. Secondly, this is among the best TLDR readmes I've ever seen! I lack exposure to this area, so I'm actually quite impressed with the complexity of it.

Keep up the great work.

richiverse8y ago

As a python programmer, I found R's pdftools to be indispensable for messy text based PDFs. I couldn't find a python lib that worked as consistently across variously different formats.

tunaoftheland8y ago

hbcondo7148y ago

Impressive! Seems like you can't just use PDFBox out of the box (no pun intended) and need to write some custom code specific to the PDF itself per the chart-parser commits[1]

[1] https://github.com/robinhowlett/chart-parser/tree/master/src...

robinhowlettOP8y ago

JabavuAdams8y ago

Crazy! I was just looking in to this topic a few weeks ago, for a friend. Thanks!

vbuwivbiu8y ago

what I would love is an app that would reformat portrait PDFs as 2-column landscape for reading on my screen

mpweiher8y ago

Most PDF viewers I am aware of (including my own, PostView) have 2 up modes, are those not sufficient?

vbuwivbiu8y ago

that's still presenting the pages in portrait orientation. I want the pages to be landscape and for the text to flow in at least 2 columns.

1 more reply

ocrimgproc8y ago

Can it be used for invoices?

j / k navigate · click thread line to collapse