History of the PDF (opens in new tab)

(sensible.so)

102 pointsdwynings2y ago57 comments

57 comments

So, reading the article is a bit weird. It's clear there's an anti-PDF bias from the start, with the implicit assumption that everybody hates reading PDF files. Actually, I don't because I get to read a well formatted document. They even say that it should only be used as a format for things to be printed, never as a document for people to read on a computer... and yet this is clearly meant to be read once on a screen and not printed out. It also contains a hypertext link to their company that obviously wouldn't work if printed, and they embed it in an iframe, because they expect people to be reading it online.

But towards the end, you start to see the real objection to PDFs - that it's not always easy to extract text automatically from a document. It mentions a few of the issues - extra spaces, not enough spaces, hidden text that gets extracted because it's off-page, fonts that are designed to obfuscate the internal text, e.g. re-arranging characters or splitting glyphs up in strange ways, etc. It's worth noting that with the exception of the spaces, these techniques are used deliberately to stop people extracting text or the copyrighted fonts from the document.

It's not at all obvious from the document itself, but if you click on the link to the company, all becomes clear. The reason this company is saying that all these things are problems with PDF is because their company is in the business of extracting raw text from PDF. Ignoring all the designers efforts to place things in specific places to make things pleasing for a human to read, etc... They don't want any of that. They just want to extract the raw text so they can data mine it and sell that as a service.

ogurechny2y ago

You are not reading a PDF document, you are reading a visual representation constructed by a program which is made by people who tear their hair out.

PDF “specification” is not a specification, it only documents the happy path. It never states that behavior of Acrobat remains the holy truth, but in practice undocumented bug-for-bug compatibility is assumed. (We're talking about most basic, universally supported features here.) If ISO was worth their salt, they would at least try to codify the de facto behavior instead of stamping their name on some Adobe-provided document, then it would be horrible but fixed format. A collection of tests would be nice to have, too.

Of course, this “history” is just a promotional leaflet, which describes the “layman approach” they tried to construct. It's a fault not to mention that PDF was, and still is, a foundation of digital print industry, where big vendors solve compatibility problems for mere mortals, and therefore create unwritten rules of what should and shouldn't work.

It is also ironic that they praise the Web, but have to use Web Archive to link to the article from the ancient year of… 2020.

Tangurena22y ago

Mostly, the page description language (inside PDFs) says "move to point (x, y) and write text foo" or "move to point (x, y) and draw glyph bar".

The reason that many PDFs produce garbage when extracting text is because the underlying document doesn't include fonts, every letter is a drawn glyph. This is most common in older (1990s) pdfs generated on UNIX systems.

Since the page description language is saying "write text foo", that text is broken up by the generating software, so there is not necessarily a whole line of text as a human would see it.

And some PDFs are impossible to extract text from because they've flattened the page into an image. Law firms are notorious for doing this - to provide the documents exactly as required/specified during discovery, but make them impossible for the text to be extracted. Basically it is a fax - every page is a TIFF image (because it is harder to OCR than a JPEG, although JBIG2 has its own flaws [0]).

I've been working with PDF projects off and on since the late 90s. The standard tries to be everything for everybody and that makes it a Charlie Foxtrot from top to bottom (if you've ever written an object viewer/editor to dig inside PDFs, you know exactly what I'm referring to). It is a great spec for making a document that appears the same no matter where it is viewed or printed. But I always treat it as a sausage: you can turn the cow into a sausage, but you can't turn that sausage back into a cow.

0 - https://www.theverge.com/2013/8/6/4594482/xerox-copiers-rand...

jxramos2y ago

I wonder if this explains why trying to copy and paste text out of a PG&E bill would always come back as gobbledygook when I used to receive such bills in the past.

anthk2y ago

DJVU works like that too; -BUT- you can embed the text with some internal operation in both kind of decuments.

On GNULinux/BSD you have OCRmyPDF to do that.

Someone2y ago

> fonts that are designed to obfuscate the internal text, e.g. re-arranging characters or splitting glyphs up in strange ways, etc. It's worth noting that with the exception of the spaces, these techniques are used deliberately to stop people extracting text or the copyrighted fonts from the document.

Maybe (and, for the fonts, likely), but I don’t think it’s the only reason. Subsetting embedded fonts makes PDFs smaller, often a lot smaller (why embed an entire font because the document uses a single glyph of it as a bullet point? Why would one include Chinese, Japanese, etc glyphs if the document doesn’t use them?)

Even if it’s possible to do that without changing the code point to glyph mapping (is it? I don’t know enough of fonts to answer that), implementing it may be simpler or result in smaller files if one makes the embedded font dense in code points (I tried finding an answer, but soon remembered how complex fonts are, and gave up)

And of course, modern tools _should_ output accessible PDF documents, which means text extraction _should_ work. I wouldn’t know how well that works in reality, but have my doubts.

merb2y ago

Actually most pdfs are formatted in a good way and it’s easy to extract text. The stupid stuff is just copy encryption, which is just a stupid feature (because pdf viewers can ignore it) I have no idea why somebody hates pdfs for extracting data, when stuff like doc,xls (the old format) is clearly way worse. Pdf sometimes has its quirks but the 2.0 version clearly cleans up a lot of the messes

aardvark1792y ago

It’s usually easy to extract individual strings from a PDF, normally single lines, but it can be quite hard to understand how those form into longer paragraphs, especially if the page has multiple columns and inline figures.

It’s also easy to create a PDF that it is hard to extract text from, not through an deliberate attempt to enforce copy protection but often simply from attempts to compress the size of the file as you may not want to store the entirety of a font in a document.

I’ve been on both ends of this, generating documents and consuming them, and I think we probably created something that allowed for much easier text extraction, but it’s far too late now.

crispyambulance2y ago

> I have no idea why somebody hates pdfs for extracting data, when stuff like doc,xls (the old format) is clearly way worse.

Is this sarcasm?

AFAIK, pdf is deliberately designed to not give a F about semantics. There is no way to determine what is part of what in a pdf document. All you got is association by adjacency.

Hasn't it always been that way? Has something changed?

JKCalhoun2y ago

Most often the text runs you would extract from a PDF are in the order you would read the text. Probably because the word processor or application that created the PDF dumped the text into the capturing PDF context from its own text container that, in fact, contains the text in the order the word processor would display it (editing, searching, text selection in the creating app obviously benefit if the text container is in reading order).

When text is not in reading order within a PDF page it is often headers, footers, captions, callouts, block quotes....

There are I believe features in the modern PDF spec to allow for accessibility that would give you more structure to that raw text. I am not sure that this is a widely used feature when creating PDFs though.

1 more reply

eviks2y ago

Except it's a poorly formatted document because it's not formatted to fit screens of different width, which is huge (phones are a thing)

Also you haven't solved another huge fail of the most basic digital workflow - copy&paste - by pointing at the motivation of the author since "except spaces" ruin it for everyone, not just professional data extractors

herodotus2y ago

After working on PDF document reconstruction for more than a decade, I often fantasized about inventing a cleaner and simpler alternative. After all, there are only three kinds of objects in PDF: shapes, images and glyphs. But it is all those little details that will get you in the end. A line - all you need is a coordinate and a length, right? No: is it solid, what is its width? Is its end point anchored on the left most part of the visible line, or does the thickness spread out from the anchor point? And is the end square of curved? If curved, what are the parameters of the curve? Are both ends the same? On it on it goes. And don't event get me started on glyphs...

PDF is a remarkable creation. It has some notable weaknesses, such as the fact that its color channel for images does not include alpha, and thus needs masks, but the fact that it covers so much visual complexity in a relatively compact form is just amazing. (BTW: Its graphics model is strictly from Adobe Postscript, but PDF content streams are not programs.)

One thing that bugged me while reading this article was the use of the definite article ("the PDF"). Since PDF is an acronym for "Portable Document Format" there may be a grammatical case to be made for the "the", but no one says "the HTML" or "the NASA" and so on.

tonyedgecombe2y ago

>In 2020, Nielsen made the case again, writing, “After 20 years of watching users perform similar tasks on a variety of sites that use either PDFs or regular web pages, one thing remains certain: PDFs degrade the user experience.”

Good luck saving a HTML version of any modern web page and being able to read it in twenty or thirty years time. HTML just wasn't designed for that.

kps2y ago

HTML was; Javascript wasn't.

wslh2y ago

I used to click "Print" (macOS) and save to file in (ironically for this article) PDF... does it work for you?

______2y ago

An issue with this is that the print CSS of most websites is an afterthought.

While it’s possible to alter the design with @media print as well as the page breaks, few websites do this. You are often left with broken layouts, empty pages, or nonsense page breaks.

whartung2y ago

For some insane reason, one of the stores we use that you can order online and pickup at the store, when you try to print the page with the barcode, the barcode does not print. We end up having to take a screenshot and print that. It's just utterly baffling, especially for this specific use case.

eviks2y ago

Good luck doing the same in PDF with embedded javascript, which is the stuff that's not designed, not HTML

contrarian12342y ago

Are there any real alternatives?

I tried to make a conference poster with SVG - using Inkscape - and it was minor disaster that rendered differently in different programs/browsers, with some features entirely broken

but I don't know of a third option..

epc2y ago

IBM tried to push a competitor in the 1990s…BookManager was an initially mainframe (VM/CMS, MVS, etc) combination of viewer program and proprietary format. It came about in response to both IBM customers and product documentation groups demanding some sort of online “hypertext” version of the thousands of publications available.

IIRC it came out around the same time as the initial Acrobat format but not necessarily in response to it. Eventually there were viewers for Windows, OS/2. It wasn't particularly bad, but it was very literal in display and Acrobat/PDF rapidly left it in the dust.

When the web boomed in 1995–1996 the product group behind BookManager tried to ban distribution of PDFs by other IBM groups but failed. One of the problems with BookManager formatted files is you had to recreate the appropriate record format if you transferred it back to a mainframe, and I vaguely recall EBCDIC vs ASCII issues (where PDF is, I think, UTF native?).

https://en.wikipedia.org/wiki/SCRIPT_(markup)#BookManager

signaru2y ago

Microsoft tried with XPS which is a zipped XML format, pretty much like MS Office 2007+ files. To Adobe's credit, they made PDF an open standard around the time XPS came out. Maybe it's a combination of being there first, many files already in PDF, and finally making the format open which made PDF win.

mdaniel2y ago

The LOC's write-up is informative, I think: https://www.loc.gov/preservation/digital/formats/fdd/fdd0005...

They do say "please don't" as far as their "LC preference"[1] but then later in the document have nice things to say about the format being just .zip and .xml so its introspection and recovery options are much larger than "welp, hope pdf2text still exists in 2040"

1: they have a Recommended Formats Statement: https://www.loc.gov/preservation/resources/rfs/ which currently is published in html and pdf with a "Get Adobe Reader" button on the page, which I feel is dangerously misguided advice

martin_a2y ago

Well, this is exactly why PDF was invented and is doing its job so well. To preserve a desired layout and very specific information on how something has to be outputted.

That comes with downsides, yes, but at its core it's just working fine.

edit: Third option would be to render your content as an image, but that comes with its own downsides.

contrarian12342y ago

expect I don't feel many programs are working in PDF natively except for Adobe products. It's always just an export target, or you "print to PDF"

So to me it kinda looks like the format is lacking

I also don't know much about it, but I assume it's not easy to generate programmatically. While for instance generating an SVG diagram/image is generally pretty trivial

FuriouslyAdrift2y ago

There's DVI (device independent file format) from TeX

https://en.wikipedia.org/wiki/Device_independent_file_format

anthk2y ago

DJVU, but it's raster I think.

jimjimjim2y ago

PDF is the worst document format, apart from all the other formats. When developing software to read or process PDFs the PDF spec can always deliver a jump scare like no other spec. But to give it credit it broke Microsoft's stranglehold on documents, not completely, but back in the mid 2000s organizations no longer required you to submit things as word documents anymore.

jansan2y ago

Adobe file formats never had the reputation of being easy to work with. I spent some time with the CFF font format, and I can say it was not a pleasure.

balder19912y ago

But is the word XML format really bad nowadays?

HenryBemis2y ago

That's the thing. Every PC, Mac, Android, iPhone can display PDF files. You can capture all elements of a website on one.

It just works.

Findecanor2y ago

> It just works.

As long are you're using the one true program: Adobe Acrobat.

A few years ago I did the mistake of printing a PDF shipping label from a web browser ... which left a critical bar-code blank.

dave80882y ago

Here’s a link that works: https://www.sensible.so/blog/history-of-the-pdf

dang2y ago

Thanks! I've changed to that from https://www.sensible.so/history-of-the-pdf-pdf above.

dblitt2y ago

For some reason, this page embeds the PDF as an iframe. The actual PDF is at https://19971168.fs1.hubspotusercontent-na1.net/hubfs/199711...

ralferoo2y ago

It's so they can add their help / advertising at the bottom right of the screen.

whartung2y ago

This is a nit for me in the PDF experience. The browsers I use tend to have no difficulty in rendering the PDF in the browser, but every now and again, you click on a PDF and now it's in your Downloads folder and opening up your OSs PDF viewer. 99% of the time, those PDFs are as disposable as HTML pages that I'd rather not manage and intern on my machine.

dwyningsOP2y ago

It was actually because I didn't want to link to a hubspot url and webflow only supports uploads up to 10mb.

ggm2y ago

"read this PDF online" goes to a 404. Perhaps un-ironically? If you have the PDF you can read it!

martin_a2y ago

For everybody complaining about the non-transformative character of PDF: There are several PDF standards out in the wild.

In the graphic industry we mainly use PDF/X files. These are very solid and precise in defining the layout and how objects are rendered.

For archiving purposes there's another standard, it's called PDF/A. Part of PDF/A is that you must be able to transform its text content back to Unicode.

So, if you're looking into being able to convert PDFs back and forth, you should probably use PDF/A. PDF/X files will drop that support to maintain the desired appearance as close as possible.

https://en.wikipedia.org/wiki/PDF/A

FinnKuhn2y ago

I would also add that .pdfs are often not meant to be transformed. They are the digital equivalent of a book, which no one complains about not being able to edit. If you wanted to have a document you could edit you don't use .pdf, but something else before you convert export it as an .pdf. The same is true about images. No one complains about .jpg not being editable, as any sane person would use a photoshop or similar file and only export the final product.

ogurechny2y ago

PDF/A is a joke of a “standard” that does almost nothing that is promised on the cover. It is just a subset of PDF with limits on variable options like color representation, frozen at some arbitrary point in time, probably because people working with digital archives realized that they couldn't reach the moving goal, and implement the ever growing list of features. We may only expect programs producing PDF/A files to be less “creative”, and produce straightforward markup, but it's not guaranteed at all, because PDF/A doesn't address any of the real core format issues.

divbzero2y ago

From page 12 of the PDF:

> Comments on places like HackerNews refer to it as “one of the worst file formats ever produced” [1], “soul-crushing” [2], and something that “should really be destroyed with fire” [3].

I found the source for [1] and [3] https://news.ycombinator.com/item?id=22474460 but couldn’t identify the source of [2].

DeathArrow2y ago

PDF as a format is bad when you want to edit or extract text. But do we have an alternative to it that is as portable?

FuriouslyAdrift2y ago

That's because it's intent is for output rendering and not editing. There source document(s) are what is used for editing.

korp2y ago

This doesn't make the best coffee table book.

mobilio2y ago

Fun fact first version for PC was DOS version:

https://winworldpc.com/product/acrobat-reader/1

HenryBemis2y ago

I remember seeing this software on the university's computers, an acrobat.. and I was thinking.. WHAT is that software that I see in EVERY PC? I didn't know what PDF was at the time. I grew up with an Amstrad PC1512 (oh, and a family too) but never had to use Acrobat. Only GW-Basic, Zaxxon, Bubble Bobble, Defender of the Crown, and other super useful software ;)

The most 'complicated' software I used was Volkswriter!

phkx2y ago

Statisticians beware: It‘s about the Portable Document Format.

anticensor2y ago

As opposed to probability distribution functions, I guess?

jtvjan2y ago

I'm so impressed by the design of this PDF file. It's amazing that they put in so much effort to design what comes down to just an informational article.

dwyningsOP2y ago

Yeah, we tried to have some fun with it.

Alifatisk2y ago

I like PDF, It does one thing and it does it well. I just wish there was free alternatives to work with PDFs. I can't believe something so well adopted by everyone is still closely controlled by Adobe.

You can't a free tool that offers features close to Adobe Acrobat, there is none. You have to download multiple tools that each offers their own feature close to Acrobat.

davidthewatson2y ago

I thought being emancipated from word docs was freedom until I realized the suffering brought on by a left field version of wkhtmltopdf no one uses but someone built and distributed via pamac AUR. In software, Hybridization is Postmodernism.

davidthewatson2y ago

PDF shares a common property with computational complexity and robotics: the magical world of software is no longer free from physics, as if it ever was. Software has reciprocity in creating these illusions and destroying them.

vlark2y ago

You can tell the writers and designers hate the PDF format because they made the damn thing so difficult to read, layout-wise. They went 1990s PageMaker/QuarkXpress crazy here.

j / k navigate · click thread line to collapse

57 comments

ralferoo2y ago

ogurechny2y ago

You are not reading a PDF document, you are reading a visual representation constructed by a program which is made by people who tear their hair out.

It is also ironic that they praise the Web, but have to use Web Archive to link to the article from the ancient year of… 2020.

Tangurena22y ago

Mostly, the page description language (inside PDFs) says "move to point (x, y) and write text foo" or "move to point (x, y) and draw glyph bar".

Since the page description language is saying "write text foo", that text is broken up by the generating software, so there is not necessarily a whole line of text as a human would see it.

0 - https://www.theverge.com/2013/8/6/4594482/xerox-copiers-rand...

jxramos2y ago

I wonder if this explains why trying to copy and paste text out of a PG&E bill would always come back as gobbledygook when I used to receive such bills in the past.

anthk2y ago

DJVU works like that too; -BUT- you can embed the text with some internal operation in both kind of decuments.

On GNULinux/BSD you have OCRmyPDF to do that.

Someone2y ago

And of course, modern tools _should_ output accessible PDF documents, which means text extraction _should_ work. I wouldn’t know how well that works in reality, but have my doubts.

merb2y ago

aardvark1792y ago

I’ve been on both ends of this, generating documents and consuming them, and I think we probably created something that allowed for much easier text extraction, but it’s far too late now.

crispyambulance2y ago

> I have no idea why somebody hates pdfs for extracting data, when stuff like doc,xls (the old format) is clearly way worse.

Is this sarcasm?

AFAIK, pdf is deliberately designed to not give a F about semantics. There is no way to determine what is part of what in a pdf document. All you got is association by adjacency.

Hasn't it always been that way? Has something changed?

JKCalhoun2y ago

When text is not in reading order within a PDF page it is often headers, footers, captions, callouts, block quotes....

1 more reply

eviks2y ago

Except it's a poorly formatted document because it's not formatted to fit screens of different width, which is huge (phones are a thing)

herodotus2y ago

tonyedgecombe2y ago

Good luck saving a HTML version of any modern web page and being able to read it in twenty or thirty years time. HTML just wasn't designed for that.

kps2y ago

HTML was; Javascript wasn't.

wslh2y ago

I used to click "Print" (macOS) and save to file in (ironically for this article) PDF... does it work for you?

______2y ago

An issue with this is that the print CSS of most websites is an afterthought.

While it’s possible to alter the design with @media print as well as the page breaks, few websites do this. You are often left with broken layouts, empty pages, or nonsense page breaks.

whartung2y ago

eviks2y ago

Good luck doing the same in PDF with embedded javascript, which is the stuff that's not designed, not HTML

contrarian12342y ago

Are there any real alternatives?

I tried to make a conference poster with SVG - using Inkscape - and it was minor disaster that rendered differently in different programs/browsers, with some features entirely broken

but I don't know of a third option..

epc2y ago

https://en.wikipedia.org/wiki/SCRIPT_(markup)#BookManager

signaru2y ago

mdaniel2y ago

The LOC's write-up is informative, I think: https://www.loc.gov/preservation/digital/formats/fdd/fdd0005...

martin_a2y ago

Well, this is exactly why PDF was invented and is doing its job so well. To preserve a desired layout and very specific information on how something has to be outputted.

That comes with downsides, yes, but at its core it's just working fine.

edit: Third option would be to render your content as an image, but that comes with its own downsides.

contrarian12342y ago

expect I don't feel many programs are working in PDF natively except for Adobe products. It's always just an export target, or you "print to PDF"

So to me it kinda looks like the format is lacking

I also don't know much about it, but I assume it's not easy to generate programmatically. While for instance generating an SVG diagram/image is generally pretty trivial

FuriouslyAdrift2y ago

There's DVI (device independent file format) from TeX

https://en.wikipedia.org/wiki/Device_independent_file_format

anthk2y ago

DJVU, but it's raster I think.

jimjimjim2y ago

jansan2y ago

Adobe file formats never had the reputation of being easy to work with. I spent some time with the CFF font format, and I can say it was not a pleasure.

balder19912y ago

But is the word XML format really bad nowadays?

HenryBemis2y ago

That's the thing. Every PC, Mac, Android, iPhone can display PDF files. You can capture all elements of a website on one.

It just works.

Findecanor2y ago

> It just works.

As long are you're using the one true program: Adobe Acrobat.

A few years ago I did the mistake of printing a PDF shipping label from a web browser ... which left a critical bar-code blank.

dave80882y ago

Here’s a link that works: https://www.sensible.so/blog/history-of-the-pdf

dang2y ago

Thanks! I've changed to that from https://www.sensible.so/history-of-the-pdf-pdf above.

dblitt2y ago

For some reason, this page embeds the PDF as an iframe. The actual PDF is at https://19971168.fs1.hubspotusercontent-na1.net/hubfs/199711...

ralferoo2y ago

It's so they can add their help / advertising at the bottom right of the screen.

whartung2y ago

dwyningsOP2y ago

It was actually because I didn't want to link to a hubspot url and webflow only supports uploads up to 10mb.

ggm2y ago

"read this PDF online" goes to a 404. Perhaps un-ironically? If you have the PDF you can read it!

martin_a2y ago

For everybody complaining about the non-transformative character of PDF: There are several PDF standards out in the wild.

In the graphic industry we mainly use PDF/X files. These are very solid and precise in defining the layout and how objects are rendered.

For archiving purposes there's another standard, it's called PDF/A. Part of PDF/A is that you must be able to transform its text content back to Unicode.

So, if you're looking into being able to convert PDFs back and forth, you should probably use PDF/A. PDF/X files will drop that support to maintain the desired appearance as close as possible.

https://en.wikipedia.org/wiki/PDF/A

FinnKuhn2y ago

ogurechny2y ago

divbzero2y ago

From page 12 of the PDF:

I found the source for [1] and [3] https://news.ycombinator.com/item?id=22474460 but couldn’t identify the source of [2].

DeathArrow2y ago

PDF as a format is bad when you want to edit or extract text. But do we have an alternative to it that is as portable?

FuriouslyAdrift2y ago

That's because it's intent is for output rendering and not editing. There source document(s) are what is used for editing.

korp2y ago

This doesn't make the best coffee table book.

mobilio2y ago

Fun fact first version for PC was DOS version:

https://winworldpc.com/product/acrobat-reader/1

HenryBemis2y ago

The most 'complicated' software I used was Volkswriter!

phkx2y ago

Statisticians beware: It‘s about the Portable Document Format.

anticensor2y ago

As opposed to probability distribution functions, I guess?

jtvjan2y ago

I'm so impressed by the design of this PDF file. It's amazing that they put in so much effort to design what comes down to just an informational article.

dwyningsOP2y ago

Yeah, we tried to have some fun with it.

Alifatisk2y ago

You can't a free tool that offers features close to Adobe Acrobat, there is none. You have to download multiple tools that each offers their own feature close to Acrobat.

davidthewatson2y ago

vlark2y ago

You can tell the writers and designers hate the PDF format because they made the damn thing so difficult to read, layout-wise. They went 1990s PageMaker/QuarkXpress crazy here.

j / k navigate · click thread line to collapse