All PDF parsers that I have tried cope very badly with these kind of situations, and often try to be 'too clever' in that they value the final layout of the text over and above the individual strings.
Have you experienced similar problems with PDFBox, or does it handle formatting and layout fairly reliably?
https://github.com/apache/pdfbox/blob/6f18d7c4bef4d23a22d/ex...
With PDFBox I was able to deal with the content at a very low level (on a per-character basis), so that when for instance building a String, I would insert a pipe character when the distance between adjacent characters was greater than the width of the space character and then detect that when translating to a certain field.
See the convertToText() method for an example: https://github.com/robinhowlett/chart-parser/blob/master/src...
and https://github.com/robinhowlett/chart-parser/blob/f8d651e9a1... for when I used this technique
Edit: Looks like it's on the second page of results and I never made it that far, heh. Goes to show how biasing the first page of results is.
(Its PostScript origins may also explain the bizarre mix of text and binary that constitute the file format. For example, page contents are in a relatively free-form PostScript-ish RPN-like textual language, but are found in "content streams" which may be compressed or encoded into a binary format. Data "object" structures include things like '<<'-delimited dictionaries, '[' arrays ']', textual "/Names", and even provisions for comments(!?).
Then there are things like the cross-reference table of all objects in the file, which is an array of fixed-width textual numbers representing file offsets, e.g. "0000001056 00000 n" refers to something 1056 bytes from the start of the file. Reactions of WTF!? from those working with the format for the first time are not uncommon.)
I was just looking at collecting race information and historical results data a month or two ago and was struck by the lack of available structured data. Heck, I couldn't easily find any for pay options either.
Keep up the great work.
[1] https://github.com/robinhowlett/chart-parser/tree/master/src...