But towards the end, you start to see the real objection to PDFs - that it's not always easy to extract text automatically from a document. It mentions a few of the issues - extra spaces, not enough spaces, hidden text that gets extracted because it's off-page, fonts that are designed to obfuscate the internal text, e.g. re-arranging characters or splitting glyphs up in strange ways, etc. It's worth noting that with the exception of the spaces, these techniques are used deliberately to stop people extracting text or the copyrighted fonts from the document.
It's not at all obvious from the document itself, but if you click on the link to the company, all becomes clear. The reason this company is saying that all these things are problems with PDF is because their company is in the business of extracting raw text from PDF. Ignoring all the designers efforts to place things in specific places to make things pleasing for a human to read, etc... They don't want any of that. They just want to extract the raw text so they can data mine it and sell that as a service.
PDF “specification” is not a specification, it only documents the happy path. It never states that behavior of Acrobat remains the holy truth, but in practice undocumented bug-for-bug compatibility is assumed. (We're talking about most basic, universally supported features here.) If ISO was worth their salt, they would at least try to codify the de facto behavior instead of stamping their name on some Adobe-provided document, then it would be horrible but fixed format. A collection of tests would be nice to have, too.
Of course, this “history” is just a promotional leaflet, which describes the “layman approach” they tried to construct. It's a fault not to mention that PDF was, and still is, a foundation of digital print industry, where big vendors solve compatibility problems for mere mortals, and therefore create unwritten rules of what should and shouldn't work.
It is also ironic that they praise the Web, but have to use Web Archive to link to the article from the ancient year of… 2020.
The reason that many PDFs produce garbage when extracting text is because the underlying document doesn't include fonts, every letter is a drawn glyph. This is most common in older (1990s) pdfs generated on UNIX systems.
Since the page description language is saying "write text foo", that text is broken up by the generating software, so there is not necessarily a whole line of text as a human would see it.
And some PDFs are impossible to extract text from because they've flattened the page into an image. Law firms are notorious for doing this - to provide the documents exactly as required/specified during discovery, but make them impossible for the text to be extracted. Basically it is a fax - every page is a TIFF image (because it is harder to OCR than a JPEG, although JBIG2 has its own flaws [0]).
I've been working with PDF projects off and on since the late 90s. The standard tries to be everything for everybody and that makes it a Charlie Foxtrot from top to bottom (if you've ever written an object viewer/editor to dig inside PDFs, you know exactly what I'm referring to). It is a great spec for making a document that appears the same no matter where it is viewed or printed. But I always treat it as a sausage: you can turn the cow into a sausage, but you can't turn that sausage back into a cow.
0 - https://www.theverge.com/2013/8/6/4594482/xerox-copiers-rand...
On GNULinux/BSD you have OCRmyPDF to do that.
Maybe (and, for the fonts, likely), but I don’t think it’s the only reason. Subsetting embedded fonts makes PDFs smaller, often a lot smaller (why embed an entire font because the document uses a single glyph of it as a bullet point? Why would one include Chinese, Japanese, etc glyphs if the document doesn’t use them?)
Even if it’s possible to do that without changing the code point to glyph mapping (is it? I don’t know enough of fonts to answer that), implementing it may be simpler or result in smaller files if one makes the embedded font dense in code points (I tried finding an answer, but soon remembered how complex fonts are, and gave up)
And of course, modern tools _should_ output accessible PDF documents, which means text extraction _should_ work. I wouldn’t know how well that works in reality, but have my doubts.
It’s also easy to create a PDF that it is hard to extract text from, not through an deliberate attempt to enforce copy protection but often simply from attempts to compress the size of the file as you may not want to store the entirety of a font in a document.
I’ve been on both ends of this, generating documents and consuming them, and I think we probably created something that allowed for much easier text extraction, but it’s far too late now.
Is this sarcasm?
AFAIK, pdf is deliberately designed to not give a F about semantics. There is no way to determine what is part of what in a pdf document. All you got is association by adjacency.
Hasn't it always been that way? Has something changed?
Also you haven't solved another huge fail of the most basic digital workflow - copy&paste - by pointing at the motivation of the author since "except spaces" ruin it for everyone, not just professional data extractors
PDF is a remarkable creation. It has some notable weaknesses, such as the fact that its color channel for images does not include alpha, and thus needs masks, but the fact that it covers so much visual complexity in a relatively compact form is just amazing. (BTW: Its graphics model is strictly from Adobe Postscript, but PDF content streams are not programs.)
One thing that bugged me while reading this article was the use of the definite article ("the PDF"). Since PDF is an acronym for "Portable Document Format" there may be a grammatical case to be made for the "the", but no one says "the HTML" or "the NASA" and so on.
Good luck saving a HTML version of any modern web page and being able to read it in twenty or thirty years time. HTML just wasn't designed for that.
While it’s possible to alter the design with @media print as well as the page breaks, few websites do this. You are often left with broken layouts, empty pages, or nonsense page breaks.
I tried to make a conference poster with SVG - using Inkscape - and it was minor disaster that rendered differently in different programs/browsers, with some features entirely broken
but I don't know of a third option..
IIRC it came out around the same time as the initial Acrobat format but not necessarily in response to it. Eventually there were viewers for Windows, OS/2. It wasn't particularly bad, but it was very literal in display and Acrobat/PDF rapidly left it in the dust.
When the web boomed in 1995–1996 the product group behind BookManager tried to ban distribution of PDFs by other IBM groups but failed. One of the problems with BookManager formatted files is you had to recreate the appropriate record format if you transferred it back to a mainframe, and I vaguely recall EBCDIC vs ASCII issues (where PDF is, I think, UTF native?).
They do say "please don't" as far as their "LC preference"[1] but then later in the document have nice things to say about the format being just .zip and .xml so its introspection and recovery options are much larger than "welp, hope pdf2text still exists in 2040"
1: they have a Recommended Formats Statement: https://www.loc.gov/preservation/resources/rfs/ which currently is published in html and pdf with a "Get Adobe Reader" button on the page, which I feel is dangerously misguided advice
That comes with downsides, yes, but at its core it's just working fine.
edit: Third option would be to render your content as an image, but that comes with its own downsides.
So to me it kinda looks like the format is lacking
I also don't know much about it, but I assume it's not easy to generate programmatically. While for instance generating an SVG diagram/image is generally pretty trivial
https://en.wikipedia.org/wiki/Device_independent_file_format
It just works.
As long are you're using the one true program: Adobe Acrobat.
A few years ago I did the mistake of printing a PDF shipping label from a web browser ... which left a critical bar-code blank.
In the graphic industry we mainly use PDF/X files. These are very solid and precise in defining the layout and how objects are rendered.
For archiving purposes there's another standard, it's called PDF/A. Part of PDF/A is that you must be able to transform its text content back to Unicode.
So, if you're looking into being able to convert PDFs back and forth, you should probably use PDF/A. PDF/X files will drop that support to maintain the desired appearance as close as possible.
> Comments on places like HackerNews refer to it as “one of the worst file formats ever produced” [1], “soul-crushing” [2], and something that “should really be destroyed with fire” [3].
I found the source for [1] and [3] https://news.ycombinator.com/item?id=22474460 but couldn’t identify the source of [2].
The most 'complicated' software I used was Volkswriter!
You can't a free tool that offers features close to Adobe Acrobat, there is none. You have to download multiple tools that each offers their own feature close to Acrobat.