Preview in macOS Big Sur is destroying PDFs (opens in new tab)

(annoying.technology)

359 pointsmatrixagent5y ago317 comments

317 comments

I have learned to be scared of my MacBook. Seemingly safe behavior can cause permanent damage. It does completely unexpected things, apparently by design.

I do not put my pictures in the ~/Pictures directory for fear of what the newest app will do to “improve” them for me. I fully expect it to apply lossy compression to my files without asking. This is after Photos or whatever it was called at the time mangled the dates on a bunch of my vacation photos to 10 years before the actual trip.

Oh and have fun when your photos are automatically uploaded to iCloud to save space locally then silently deleted from iCloud to... save space? My sister lost her first year of baby pictures to that one.

Same with ~/Music after iTunes wiped out a bunch of carefully curated metadata. Yes, I did want that album art.

I fat-fingered some key combination in Messages recently and got a prompt confirming I wanted to delete the entire conversation history. I consider myself lucky it bothered to ask.

I can add “view a PDF” to the list of things likely to leave me holding the bag.

cle5y ago

I have run into that Messages fat-finger-delete multiple times, it is infuriating! I still don't know what the key combination is, but IIRC the confirmation defaults to "Delete" when enter or space are pressed, which are...quite common when sending short messages.

tekacs5y ago

They've finally removed the keyboard shortcut for this in Big Sur. :)

It was Cmd + Delete/Backspace before.

1 more reply

mulmen5y ago

After my latest iCloud password change Messages has also been giving me the beachball of doom when images are received. I'm terrified of what that implies. I'm looking forward to the announcement of the exploit where a carefully crafted image owns MacBooks.

FireBeyond5y ago

Multiple people are complaining that Big Sur is blowing out their speakers on the laptops.

I submitted something yesterday where Big Sur completely breaks DSC for all non-Apple monitors (and in some cases, even those).

Oof.

p1necone5y ago

The more I learn about macs the more I think the "it just works" crowd really mean "I will sacrifice my system "working" sometimes in exchange for having zero configuration options".

iansinnott5y ago

In the past it really did just work and macs were configurable. The trend of limited configuration is more recent (and yes, it's terrible).

netflixandkill5y ago

For a decade or so it pretty much did just work. Alas, nothing gold (or metallic shades of white in this case) can stay.

norswap5y ago

Apple, who sold hardware because they had the best software, now sells software because they have the best hardware.

Funny how things change.

dkonofalski5y ago

Those things are literally not possible to happen without your intervention... ಠ_ಠ

kalleboo5y ago

Yeah, macOS does not touch files in Pictures or Music, only files that you've explicitly imported into the Photos or Music/iTunes apps. And it definitely doesn't silently delete photos from iCloud, if that was common it would be a major bug that would make the news.

1 more reply

robertoandred5y ago

Do you have proof of iCloud deleting photos?

meibo5y ago

I assume these are a result of bad UX, at least in my personal experience.

iPhones used to/will(haven't had the pleasure in a year now) bother you quite heavily if you're at your iCloud storage cap to either upgrade or clean it out. Not a stretch that some users might not think long enough about the consequences.

1 more reply

Yetanfou5y ago

Install 'Linux' [1] on the machine then? That is, assuming you're using a model which is supported by some form of Linux. That way you get to use the hardware without being bitten by the software. Linux distributions are not perfect either but they offer fewer such 'surprises'. Keep MacOS around for those times you need to run software which is only supported there but do your main work in Linux.

[1] where 'Linux' stands for any supported Linux distribution

xxpor5y ago

Why anyone treats PDF as anything but a write-once format is beyond me. It's so finicky that I'm not shocked bugs like this happen. The only programs I'd be reasonable sure wouldn't screw it up are Acrobat itself, and pdflatex and friends.

I think we need a multi-image container format. It could be something that's literally a bunch of jpgs/pngs/pick your poison in a tar container, and given a new extension. OSes would open it and present it as a gallery in order. There's no value in a non-ocr'd PDF existing. For OCR'd text that gets more complicated, but it feels like we should be able to come up with a common denominator that doesn't have the legacy of a binary format derived from postscript in the early 90s.

systemvoltage5y ago

PDF is one of the best things IMO - it's like a docker container for documents. The way the original authors intended it to be, including fonts and all the things that go into making a document.

For example, I personally like to purchase books that are in PDF format, not epub/mobi. I want to rely on professional typesetting from the publishers, not some front-end engineer's vision of what the ebook should look like and how it should be presented to the user. It works only for novels and long form reading where typesetting is not critical. Basically any book that also has a good audiobook version works fine with epub/mobi since visual formatting is a non-issue. For everything else - programming books to research papers, PDF is great. Can PDF format be improved? Sure, but the level of adoption and its widespread use is more important than fixing copy paste and content migration aspects of PDFs.

What I absolutely DO NOT WANT - is web page like format with auto-flowing text and something that fits to the screen with user styling/typesetting. That IMO defeats the purpose of what the container is supposed to do, i.e. contain and not leak. It should most definitely have fixed physical dimensions (or pixels).

wtallis5y ago

> PDF is one of the best things IMO - it's like a docker container for documents. The way the original authors intended it to be, including fonts and all the things that go into making a document.

Sounds like you want the PDF/A standard for archiving: https://en.wikipedia.org/wiki/PDF/A

It forbids embedding audio and video and JavaScript, requires embedding all the fonts, forbids encryption and patent-encumbered compression. It's basically the PDF format with the most regrettable features stripped out.

2 more replies

inetknght5y ago

> I want to rely on professional typesetting from the publishers, not some front-end engineer's vision of what the ebook should look like and how it should be presented to the user.

I, on the other hand, do not whatsoever trust publishers. PDF runs software and that means that everything under the sun including malware and DRM can run on PDF -- and indeed that has occurred many times. It should be a non-starter for anyone who actually values the ability to read their content on any device they want.

2 more replies

markandrewj5y ago

Generally the PDF copy of a book is designed to fit the format of the final printed book. The PDF is even often the deliverable sent to press.

If you are using something like an e-reader, or a smartphone, the PDF layout often doesn't translate well. Typesetting is also normally also done for ePub/Mobi, but the layout is tailored to the format of the device that the document will be read on. Although there are times when publishers just take the PDF and click 'convert to ePub', which isn't ideal.

There are also other advantages to different formats, when talking about things like programming books. As an example, working code snippets for web formats. I am thinking of things like Jupyter notebook when I say this.

As others have mentioned there are also a number of security risks associated with PDF.

I can't deny there are several books I have read in PDF format however.

2 more replies

odyssey75y ago

Knuth as a book author has written about this on his website.

“Warning: Unfortunately, however, non-PDF versions have also appeared, against my recommendations, and those versions are frankly quite awful. A great deal of expertise and care is necessary to do the job right. If you have been misled into purchasing one of these inferior versions (for example, a Kindle edition), the publishers have told me that they will replace your copy with the PDF edition that I have personally approved. Do not purchase eTAOCP in Kindle format if you expect the mathematics to make sense. (The ePUB format may be just as bad; I really don't want to know, and I am really sorry that it was released.) Please do not tell me about errors that you find in a non-PDF eBook; such mistakes should be reported directly to the publisher.”

https://www-cs-faculty.stanford.edu/~knuth/taocp.html

1 more reply

ernst_klim5y ago

> I personally like to purchase books that are in PDF format, not epub/mobi

PDF is abysmal for books though. 1) You can't scale fonts to fit various screen sizes. 2) It's waaay more expensive to interpret and render, so it affects battery life.

3 more replies

jcelerier5y ago

> The way the original authors intended it to be, including fonts and all the things that go into making a document.

cries in random PDFs that end up being printed with letters spaced by 1 cm

1 more reply

cratermoon5y ago

What size and resolution is your eReader? I can read PDFs on my computers fine, the screen is large enough. On my eReader I can't both see a whole page and have the text large enough to read, even with my reading glasses. I end up zooming in to bring the text up to readable size and then I'm only seeing part of the page and have to scroll left/right/up/down to read a page.

1 more reply

adrianmonk5y ago

> absolutely DO NOT WANT - is web page like format with auto-flowing text

I would love it, though, if PDF included this as something that's always entirely optional.

Sometimes I want to read something just as formatted. If my display is big enough, and the formatting has any importance, I probably want that.

Other times, my screen is smaller (phone) or the wrong shape (small laptop), and I'd rather the text confirm to the device rather than vice versa.

Also, sometimes I use my arrow keys to scroll as I read to keep my place (like a line-oriented instead of page-oriented bookmark). So just because the device is capable of it, I don't necessarily always want page-oriented original formatting because it might have two-column text to deal with or top/bottom margins that serve no purpose for me.

1 more reply

higerordermap5y ago

> The way the original authors intended it to be, including fonts and all the things that go into making a document.

So that it's very very hard to read on mobile, or in a small width window.

> For example, I personally like to purchase books that are in PDF format, not epub/mobi

exactly the opposite. I want to be able to properly reflow it in a narrow window side by side with another tab on laptop, often for technological books.

ketamine__5y ago

What device do you use to read PDF's?

5 more replies

pklausler5y ago

Would you be happy with some kind of paginated image file format, if it were not significantly larger than PDF?

fomine35y ago

Oppositely what I absolutely WANT - is web page like format with auto-flowing text and something that fits to the screen with user styling/typesetting. I wish both option is available.

1 more reply

hcurtiss5y ago

I agree completely. Have you found any good resources for acquiring books in pdf?

1 more reply

izacus5y ago

> Why anyone treats PDF as anything but a write-once format is beyond me.

sigh These kind of ignorant statemens really annoy me.

The massive part of PDF spec is dedicated to editable features (annotations, form filling and digital signatures) which are used by massive industries daily because it brings them a lot of value. It's REALLY not that hard to think for a second and remember why having a single file which can be:

- Edited, annotated and commented

- Approved with explicit markers

- Digitally signed by auditors and reviewers with tamper protection

- Rendered on anything that has a semblance of interactive UI with graceful degradation

- Archived on any kind of file storage

is insanely useful and valuable for massive worldwide industries.

TwoBit5y ago

And yet PDF re-writing keeps getting screwed up. Probably because it really is a bad format for this, with the pages of docs notwithstanding.

2 more replies

theamk5y ago

Sure, but the major text is still "write once" -- you can only add specially designated information to a read-only core.

As this article shows, a trivial operation on the core text, like "remove a blank page", is still very hard and pretty easy to get wrong.

I think having general Turing-complete power in a document format is a horrible idea. It's a pity we ended up with it. Something like OpenOffice's presentations ("odp") is also a single file which is layed out per page, and can be annotated, commented, approved, edited and so on, while not being Turing-complete.

Igelau5y ago

le sigh the surface changes you're talking about are small potatoes compared to removing a page and altering document structure without breaking anything.

1 more reply

bigbubba5y ago

In many industries, writable PDFs are commonplace and aren't seen to be a problem because people in those industries don't see any problem with using Acrobat. If you want to unseat Acrobat and PDF, I think you'll have to provide something that has equivalent or greater power.

As for multi-image container formats, I think we already have that: cbz/cbr. Just a zip or rar archive with images in it, to be displayed in order by name. This 'format' is in common use already for scans of comic books. There are numerous viewers you can use to open and browse these files. Something to consider though; accessibility. A screen reader needs to use OCR to read these files to people with impaired vision. PDF files aren't fantastic for screen readers either, but they're much better than just a JPEG. I'd love to see some sort of subtitle system for images (I think mpv could probably overlay a subtitle sidecar on a still image, but that's not widely supported. Text-based subtitles formats are easy to wire into text to speech though.)

gfody5y ago

you can also embed images and fonts in svg, probably not as reliable as pdf for pixel-perfect reproductions everywhere though

1 more reply

monocasa5y ago

The comic book piracy community in fact does have what you're asking for. The formats .cbz, .cbr, etc. are just zip and rar files respectively, with images inside and a standard format for the internal filenames so they're presented in order by your reader.

Would love something like that with svg allowed inside to support vector drawings.

ASalazarMX5y ago

And this has become the de facto open standard, supported by many document viewers, probably because it's a very straightforward combination of two widely used tools.

yarcob5y ago

Basic PDF editing (adding, removing, reordering, cropping and rotating) has been rock solid in Apple's Preview app. It's something that I dearly miss when I'm on Windows.

Which makes showstopper bugs like this a very unfortunate.

But then again, it's been common knowledge that you shouldn't upgrade to a .0 release...

helmholtz5y ago

I'm not sure how widely you've sampled the field on Windows, but in case you don't know it, I can recommend PDF-Xchange Editor. They forced it upon us at my workplace and it seemed dodgy as hell at first. But I've been using it steadily and now I really like it. I'm a paper-free researcher so I no longer print research papers. Instead, I annotate/highlight/comment using this software, and it's worked so well that I've installed it on my personal machine as well.

I've tried all of the commands you've mentioned, and so far it's worked without a hitch.

1 more reply

crazygringo5y ago

Because you often have no choice.

You have a PDF of a book and you need to export the pages of a single chapter to a new PDF.

Or you have 30 different PDF's that you need to combine into a single one.

Or a full-color large-filesize scanned PDF that you need to convert to a smaller-sized black-and-white one.

Or you need to copy quotes from a PDF to paste somewhere else.

I could go on... PDF's are documents, and normal workflows involve all sorts of conversion and rearrangement and processing of documents. That's the whole point.

_-david-_5y ago

None of those require modifying the existing document. A new document could be produced and the existing document(s) would only be read from.

1 more reply

richard_todd5y ago

There are lots of image-container formats. DjVu is my favorite, and supports OCR text annotations also. But, outside of some niches, it has fallen out of use these days. Adobe closed the quality gap by adding jbig2 and jpeg2000 support to PDF, making it possible to build a "bunch of images" PDF with similar quality to DjVu.

Yetanfou5y ago

Strangely enough (?) I have not had many problems with PDF, certainly no more than with other document formats. I often use tools like pdftk [0] (as found on many Linux distributions) to split sections out of PDFs, create new ones out of single-page PDFs created with ImageMagick, create odd-even page versions etc. I generally do not touch the more "advanced features" like embedded JS (which I have disabled in all readers which support it), I just use it as a document format which more or less guarantees the resulting document looks the way it was meant to, plus or minus a few fonts. For that purpose it works works well enough.

The "bunch of jpgs/pngs/pick your poison in a tar container" format you describe exists in a fashion: Comic Book Archive, a format meant for and mostly used for comics. It consists of a compressed archive which contains sequentially numbered "pages" which can be JPEG, PNG or other image file formats. For pure image documents it can be used as a replacement for PDF but since it does not support text it can not be used for scanned OCR'ed documents. DjVu [2] does support a text layer but that comes at the cost of complexity, it is far from the simple container you propose. Since an OCR'ed text layer needs to save not only the text itself but also the location on the image for each character I don't see any way to avoid complexity here.

[0] https://www.pdflabs.com/tools/pdftk-server/

[1] https://en.wikipedia.org/wiki/Comic_book_archive

[2] https://en.wikipedia.org/wiki/DjVu

sildur5y ago

Blind and shortsighted people may not like your idea. There are accessibility options in PDF, but even a regular one can be accessed with a screen reader. Your images won't.

bigbubba5y ago

It's also generally a pain in the ass for people with good vision too, since text can't be copied out of a JPEG. If you want to forward a paragraph of text from a pure raster document, you have to read and type it all out yourself.

1 more reply

theamk5y ago

It's not like PDF is the only one with text layer, or the only options are PDF, JPG or PNG.

DjVU can have text layers as well, and I think SVG too?

qwerty4561275y ago

> Why anyone treats PDF as anything but a write-once format is beyond me.

Because PDF is supposed to be like electronic paper and you normally can draw on a paper original.

In an ideal world OSes and apps would be better equipped to separate data and metadata and work with multi-stream files. Every file would have different kinds of pure data (e.g. scanned picture and OCRed text), metadata (e.g. title, author, ISBN/etc IDs, ToC etc) and user-generated annotations in separate streams. But our actual world is not ideal, we mix everything into one stream for every document and allow the apps modify it every time we view it.

eddieh5y ago

We have that container format. It is called: TIFF

ruined5y ago

There already exist formats as you describe, but regardless, a change of format wouldn't fix anything. The same problem would just surface there, as Preview and whatever add support for those formats, and include editing features.

If you don't want to break something, just don't try to edit it. That can't be enforced anywhere but at the individual user level.

Even proprietary formats in first-party software have this problem. Hell, plain-text editors have this problem.

pm2155y ago

It would help if Preview wasn't really keen on editing the document, though. I find it tries to modify PDFs quite often -- I think it happens if I accidentally click on a table as I scroll through the document. I have resorted to 'chmod 0444 *.pdf' on my folder of datasheets and manuals so it can't mess with what I want to be a read-only file. Preview is the only pdf viewer I've ever needed to do that for.

I've never wanted to edit a PDF, so personally I'd rather the feature was gated behind a menu option or something so you had to deliberately ask for it.

jrochkind15y ago

Making a PDF include OCR'd text seems to by definition require writing twice. It has to be edited to add the OCR'd text that was not in the original image-only PDF, no?

I mean, you can say that the second PDF is a different PDF that has only been written once, but you can say that about literally any edit, "write once" no longer means anything if you think about it that way.

aardvark1795y ago

You can do the OCRing as a part of the original scanning process. You define a blank font and place the letters at the appropriate points to allow text selection. This is also done for documents that use fonts which are not licensed for embedding, it reduces on screen rendering quality slightly as you have lost any hinting information but it’s fine with printing and reserves the intent.

tonyedgecombe5y ago

PDF does support appending to the file and it’s not just for adding pages. You can make changes throughout the original document.

I’m not sure that is what Preview does though.

idle_zealot5y ago

> I think we need a multi-image container format. It could be something that's literally a bunch of jpgs/pngs/pick your poison in a tar container

You've just described the .cbt format!

https://en.m.wikipedia.org/wiki/Comic_book_archive

marcan_425y ago

And it's a bad format, because .tar does not have an index that allows random access. It also does not have compression, unless you stick it in gzip, which makes it even worse because then you can't even skip entries, you have to decompress the whole thing from the top.

.zip is better, at least that has an index, but at the end of the file, so it's not linearly streamable.

For what it's worth, .pdf is structurally ~the same as .zip; the index is at the end, and blocks are compressed individually.

Choice of archive format matters :-)

1 more reply

nine_k5y ago

PDF is often called "electronic paper", meaning its primary use, typesetting documents on screen and paper alike.

I'd rather view that PDF is an electronic clay tablet. You can put any text or image on it, but once it's formed, you better not try to alter it, lest you break it.

m4635y ago

What annoys me is that there are some .pdf manuals I want to refer to, and when I view them in preview and go to quit it asks if I want to keep my changes or revert them.

WTF? why does viewing a manual - and I only used navigation like page up / page down - cause modifications?

soraminazuki5y ago

You've likely created a text box by double clicking on the document.

ernst_klim5y ago

> Why anyone treats PDF as anything but a write-once format is beyond me.

Hahaha. Quite a lot of people around me think that PDF is a collection of JPEGs glued together (because they mostly see PDF as scanned non-OCRed docs).

thordenmark5y ago

"Why anyone treats PDF as anything but a write-once format is beyond me."

Some of use work in print publishing and being able to edit PDF's is critical.

SoSoRoCoCo5y ago

Here's is one good reason: My company uses digital certificate signing on PDF. It is a very common usage. We have yet to migrate to DocuSign.

cmiller15y ago

> The only programs I'd be reasonable sure wouldn't screw it up are Acrobat itself

Oh no, I've seen acrobat screw with it too

db48x5y ago

There are already a handful of multi-image container formats. They're all images inside zipfiles, of course.

Siira5y ago

We already have CBR/CBZ.

christkv5y ago

CBR and CBZ for comics fits this I think?

CamperBob25y ago

.PDF works just fine. Don't blame the shortcomings of crappy software implementations on the format itself.

We need a format that will still be readable 100 years or more from now, and .PDF serves that purpose.

crazygringo5y ago

I work with a ton of PDF's between my Mac and iPad, and it mostly works but there are still just way too many bugs.

It's a lot of little things, like in Catalina where opening up the sidebar for annotations (comments) seemingly randomized their order. (Big Sur, fortunately, fixed it to be page-order again.)

Or how printing a PDF from a website (in Catalina, also seemingly fixed in Big Sur) would look right on the page... but if you copied and pasted the text from the PDF to somewhere else, something like 10% of the glyphs were scrambled ("lik3 thZs"), like some sort of character table corruption.

Or reading a PDF with Books on my iPad, maybe 10% of the time bookmarking a page... doesn't bookmark it. Or removing a bookmark... doesn't remove it. Or a handful of highlights you just made have inexplicably disappeared the next time you open the file.

Or whenever you open the PDF in Books it remembers which page you were on. Except sometimes it doesn't, so you can't really rely on that for saving your place.

Or in Books, if you select some text to copy but accidentally hit the adjacent "select all" in the pop-up menu, and you're dealing with a 400-page PDF, it just locks up and you have to restart it.

Or in Preview if you want to convert a PDF to black-and-white, there's an option for it but your PDF will balloon in filesize to 10x larger or something.

I mean, I could go on and on. It's weird, because Preview is an incredible app, really. But it really is like they build it and then never bother to test if basic workflows reliably work.

inetknght5y ago

> 10% of the glyphs were scrambled ("lik3 thZs"), like some sort of character table corruption.

As I understand it, that's a form copy protection trying to prevent exactly what you're doing.

crazygringo5y ago

While that type of copy-protection does exist in rare instances, in this case it wasn't -- it was with any normal webpage printed to PDF. It was a straight-up bug, I tried to file it with Chromium in fact [1], but it turned out to be a macOS issue that Big Sur fixed.

[1] https://bugs.chromium.org/p/chromium/issues/detail?id=112849...

Someone5y ago

More likely it’s the effect of font subsetting.

If you want your PDF to render the same everywhere, you have to embed font information (even if that font is something like Arial, which is available almost everywhere, as ‘almost everywhere’ isn’t ‘everywhere’, and because there are zillions of variations on Arial)

You don’t want to embed entire fonts, though, certainly not if a PDF uses only a few characters of a font.

The memory wise cheapest way of embedding a font leaves out the table that maps glyph numbers back to Unicode code points. If you do that, the PDF reader guesses a 1:1 mapping to ascii/Unicode.

Subletting means you can’t extract all glyphs in a font from a single file, but AFAIK, that’s just a side effect.

I guess this bug drops such a table, or messes up that translation table.

m4635y ago

Try to screenshot with a movie onscreen.

You get a grey rectangle.

drevil-v25y ago

That not remembering the last page you were at in Books drives me crazy.

My iPad would be the perfect device for reading PDF books especially technical and maths text books, if weren't for the Chinese torture of whether it will remember your page the next time you open it or will I have to scroll through until I recall where I left.

Jtsummers5y ago

I've been a happy GoodReader user for years. Syncs with a variety of sources, remembers where I was, bookmarks work. Annotations work, but I don't use that much beyond being able to say it works.

1 more reply

avalys5y ago

This is a clickbait, sensationalist headline. “Saving a PDF with Preview in Big Sur can corrupt OCR text added by a third-party program” is more accurate.

matrixagentOP5y ago

That's fair, and I honestly didn't enjoy posting the headline here, but as far as I know I have to use the original title? And the original title is from a personal blog where we talk about annoying things. We're not a professional tech blog, or a bug tracker, or… something other than our own little thing. I choose that headline not to be clickbait, but "sensationalist" is probably true because I'm personally really, really angry about this issue. I wanted to do my usual "scan all my documents for the month" routine 30 minutes before going to bed, and instead it turned into a two hour debugging session. And I can't even use my normal workflow now, possibly until March. I find it completely unacceptable that Apple would break Preview that way again. It's not even the first time. Just thinking about it now gets me going again. That's why the headline sounds like it sounds. I would have no problem at all if it was modified here, and as I said – your assessment is absolutely fair.

jabbany5y ago

Kind of agree the current title is a bit misleading.

I've never used any editing features in Preview (I mean it's called "preview" so...) and reading the title I thought this meant it was mangling files even by opening them which would have been super scary.

As for non-Acrobat software mangling PDFs after editing... Well that's much less surprising. I've even had Acrobat mangle stuff in PDFs after editing...

wil4215y ago

Preview has similar features to Microsoft’s snipping tool. Highlighting, draw basic shapes, draw free hand with a couple colors, and add some annotations. I like it better than Windows most of the time.

matrixagentOP5y ago

Well, technically you at least only have a chance of noticing the error after opening a PDF (again). I suppose that's because after saving the old, correct data is still in memory, but I don't know exactly when the corruption happens – would surprise me if it wasn't upon saving, though.

refulgentis5y ago

In my very humble opinion, it's accurate: "Saving a PDF with Preview in Big Sur [Preview in Big Sur] can corrupt OCR text added by a third-party program [is irreversibly destroying PDFs]"

I understand the (fairly common, in these comments) viewpoint that this is the fault of the "third-party program", but since the PDF is readable up until Preview touches it...I find it hard to come around to the viewpoint the third-party program is relevant. Readable bytes -> Preview -> unreadable bytes is my mental model so far.

Edit: absolutely unacceptable this is downvoted to -4. I've observed for a couple months that participation in Apple-related threads, outside indignation that Apple was involved in the discussion at all, gets down to -5 before getting back to -1 a day or two later. No matter what tone is used, this happens, and it makes the problem even worse in the long run. Been here 10 years, always been a _slight_ problem, but over the last year, it's virtually impossible to participate without continuing to slowly destroy my 11 year old account. Not sure how much longer I can keep trying.

birdyrooster5y ago

Preview isn’t breaking files by reading them, as I understand it, people are saving files with Preview and over-writing their ABBY compatible pdf. Just because the last four bytes of a file name is “.pdf” doesn’t mean anything that opens files with that suffix will work.

PDF is not a bitmap, it’s a script like HTML or JS.

People understand browser incompatibility but some how this is unconscionable.

1 more reply

dkonofalski5y ago

Maybe this can give you some insight. I downvoted your comment simply because it didn't add anything to the discussion and you made an assumption that has several faults in it.

For one, your "mental model" is off because you assume that the first part of "readable bytes" is accurate. Without actually seeing the PDF in question, you don't know if the "readable bytes" are actually corrupted and Preview is fixing them to make them readable. That would mean that Preview is actually correct in its behavior and the source document is what's flawed.

On the tail end of your mental model, then, is another assumption which is that this results in "unreadable bytes" but that's not accurate either. The PDF that results from a save in Preview may be accurate to the PDF specification and is perfectly readable as a PDF in any PDF application/reader. What's no longer readable is the extra content that was originally in the file that may not have been saved correctly, in-spec, or may have been corrupt to begin with.

A big hint as to what's going on here, now that I've had some time to review this, is that the "corruption" happens consistently - the letter "a" is always replaced by the same "corrupted" character, the letter "b" seems to be consistently replaced with the same character, etc. That points, in my opinion, to a lookup that's no longer correct. What side of that lookup is bad is hard to say without seeing the file in question.

vzqx5y ago

The title is technically accurate, but it's misleading to non-mac users like myself. I assumed the author was using functionality called "Preview" only to view the documents, rather than save them.

There's a big difference between "read-only operation is mangling files" vs. "PDF writer is buggy".

JxLS-cpgbe05y ago

> Not sure how much longer I can keep trying

Keep trying, just with a new account every few months are so. HN has no privacy controls, we must add our own.

birdyrooster5y ago

Did they save the file using Preview? If so that’s on them, they chose to write a pdf using Preview and that comes with all of the pitfalls of pdf compatibility. Does plain old PostScript have this problem?

caminocorner5y ago

I'm not the original author. The usecase I have for Preview is to open it up, read it, highlight a few things and save the file (with the new highlight overlays). I wouldn't expect that to destroy my underlying OCR (which I also use a 3rd party app for)

If the behavior changes, that's not on me, that's on Preview.

I don't have any issue with this today on my Mac, but I'm glad I didn't upgrade to BigSur. I almost did.

lilyball5y ago

Yes, they used Preview to modify the PDF.

1 more reply

CapriciousCptl5y ago

Probably most people who’ve done a bit of PDF work know there’s no guarantee of the same output from different (or even the same) editor. So I don’t think it’s Preview’s fault per since the problem is endemic to PDF. But I don’t think you can blame the user either. Really, PDFs are just these enormously useful complex things that are always breaking in unexpected ways and some people haven’t been bitten by its problems enough yet to cope properly.

lilyball5y ago

I find posts like this completely pointless when they include no details at all. This is just "there's an incompatibility between third-party software and a version of macOS that the third-party software says they don't support yet, so I'm going to publicly criticize Apple".

If you're not going to do the work to figure out what the corruption is, at least include the two PDFs so other people can look at them and see what happened.

dewey5y ago

There’s a list of blog posts about the same problem linked in the article including a radar from 2016 (https://openradar.appspot.com/29786282) on Apple’s bug tracker. It’s not exactly an obscure bug that nobody knows how to reproduce.

lilyball5y ago

A radar from 2016 is not useful, that describes an old bug. Just because the symptoms look like something we’ve seen before doesn’t mean it’s the same underlying issue.

1 more reply

matrixagentOP5y ago

> If you're not going to do the work to figure out what the corruption is …

I'm sorry, but last time I checked neither Apple nor ABBYY pay my salary. I really don't understand these takes. If Apple or ABBYY want my PDFs, they should be able to find my email address rather easily. Your tl;dr version of the post is completely unfair. I publicly criticize Apple because they are breaking something that potentially affects a lot of people who are unlikely to even know about it, and they are doing it for at least the second time now. If you don't think that's worthy of criticism, I don't know what is.

I also love how so many people assume I didn't already talk to support and file radars. I guess you had better luck in the past than me, but I can assure you, these options aren't always as useful as you might think they are.

JumpCrisscross5y ago

> neither Apple nor ABBYY pay my salary

This is a fair bar for conversation, in person or online. One can be more demanding of a public write-up.

1 more reply

ztravis5y ago

My guess is that the output PDF is still valid, but that an embedded (subset) font has had its `ToUnicode` map stripped, so that there's no link between the character codes used in the text elements and the "actual" characters they represent (there are also other ways this corruption could happen, but dropping or mangling the `ToUnicode` map seems like a likely cause).

duskwuff5y ago

This is almost certainly it. I've seen similar issues with copy/paste from poorly constructed PDFs, often ones generated by "print to PDF" features.

arthur2e55y ago

Very old LaTeX PDFs tend to have this issue too. Chances are pretty slim for profs to edit PDFs witb Preview, I think…

1 more reply

lrossi5y ago

I agree. If you look closely, you can see certain patterns repeating, they’re just not English letters. But it definitely looks like natural language, and not random binary dump.

Marioheld5y ago

Also look at the spaces. The length of the words is the same on both texts. So the content is still present just the characters got shifted.

zepto5y ago

They are using software unsupported by the vendor and blaming Apple for the outcome.

“ABBYY says they don’t support Big Sur yet, that’s fine. But Apple didn’t tell me that I can’t upgrade to Big Sur when I use ABBYY. I’d be a lot less angry if there was a changelog or release notes from Apple where it says there is a known problem with OCR’ed PDFs in Preview. Their software is broken, they need to tell me. I don’t care if it only worked because they had workarounds for super shitty PDFs that ABBYY possibly produces, I just need my OS to keep working for me.”

jcrawfordor5y ago

So Preview opens a file, which is apparently valid per Preview (Preview handles and displays it correctly). After changes, preview then saves a file that is not valid per Preview (OCR text layer is corrupted).

It is very difficult for me construe a situation in which this would not be considered errata in Preview. Even if ABBYY is writing unusual PDFs, it's popular enough software that this issue will be encountered in the real world multiple times. Having to deal with unusually formed PDFs is just a general trait of writing PDF software. If you consider it a non-issue that your software writes corrupt PDFs when the same situation is handled correctly by Acrobat, no matter how unjust you feel the cruel world to be, you should not be in the PDF business. There's a reason most PDF viewers only present a highly constrained feature set, and it's because writing a capable PDF editor is very difficult. Apple has decided they are up to the challenge, and in this case has failed.

As a general rule, if your software package opens a file fine, then writes a broken version of it, seems to all the world like a bug in your software package.

The idea that this behavior in a PDF reader can be excused because the software that generated the PDF was not approved for the operating system the viewer is running on per the vendor... kind of stretches credulity. I don't usually inquire as the OS used to generate the PDF when I receive one.

dkonofalski5y ago

If this is a bug with Preview, then that's really, really bad since Preview is a bedrock of macOS and has been for years.

However... it sounds like the issue is that FineReader is storing the OCR'd text in the metadata in a way that's not part of the official PDF spec. So, it sounds like Preview is able to open the file by ignoring that metadata and then, upon save, is storing the metadata back, as normal, which then corrupts the OCR data. This reminds me a lot of when people would store metadata like this in MP3 files to include things like album art and booklets. Normal mp3 players would ignore it as just metadata or bogus data but opening it in an audio editor would do this same thing.

I'm not sure who the "blame" lies with in this case because Abby FineReader probably is writing this stuff in a non-supported way but Preview really should just ignore it rather than trying to correct it. It's very likely that the OCR text, post-save, is actually bits from the document itself rather than from the metadata.

zepto5y ago

“The idea that this behavior in a PDF reader can be excused because the software that generated the PDF was not approved for the operating system the viewer is running on per the vendor.”

Nobody is saying that. The suggestion is that the software that generates the PDF produces corrupt documents.

The fact that the vendor of that software doesn’t approve it for Big Sur suggests that they might be aware that there are problems.

2 more replies

arvindamirtaa5y ago

Here's the rest of the statement that you left out that answers your point.

"...I just need my OS to keep working for me. This bug could hit me without even owning a scanner at all – someone sending me a PDF that I then unknowingly break before archiving it. That’s the part I’m mad about."

zepto5y ago

It doesn’t really. Firstly, that’s not what they are mad about, since that hasn’t actually happened to them. They are mad that the pdf’s their scanner produces don’t work properly.

And secondly, that hypothetical could well be using broken software too. We really don’t know how bad the PDF’s ABBY produces are, but as someone who owned this scanner and the software, it’s fairly obvious that the software is barely maintained.

It really isn’t Apple’s fault if someone else is producing bad files that they happened to previously tolerate, especially if that somebody isn’t maintaining their software.

4 more replies

matrixagentOP5y ago

Read it again. This bug could hit you without ever using ABBYY yourself. Apple broke Preview.

zepto5y ago

I read it. It doesn’t show that Apple broke Preview. It says Preview stopped being compatible with PDF’s produced by ABBYY.

Those PDFs could have been buggy all along, and only now be showing up due to improvements in preview.

It’s possible that Apple broke preview, but having seen how poorly maintained ABBYY is, I wouldn’t be surprised if it was producing malformed PDFs that just happened to work on older version of preview.

1 more reply

masswerk5y ago

"Broke" may be a bit harsh. What appears to be happening is that Preview somehow loses or corrupts the toUnicode map, which is apparently located in the metadata, when saving the PDF. Mind that every application will have to reassemble/reflow the metadata when saving a reflowed document (like after cropping and/or discarding pages). To do so, the application has to interpret and to reassemble the metadata before wiriting it back.

Now, some algorithms and routines may be more robust and allowing than others. Maybe, an innocent refactoring attempt just lost that critical bit of robustness, required to deal with that particular format produced by this particular application.

For example, consider an XML-based format, where a particular application delivered a malformed document, like a missing closing quote for some attribute. Most XML interpreters will churn happily along with this, but, after a rewrite of some routine, an application just ignores the malformed tag with the runaway string. Did it break XML? Or did it just fail to interpret a malformed document, it had somehow been able to deal with thanks to some extra robustness present in the previous version?

Considering this hypothetical case: Should that application be improved by an update to regain its previous robustness? Yes, absolutely. Is it a bug and is the vendor to blame? Probably not. Mind that this might be quite well what is happening here, as well.

1 more reply

r00fus5y ago

So is there an active Preview corruption example that doesn't involve ABBYY? I've used FineReader before for a commercial effort, I do remember it being very finicky.

2 more replies

ehutch795y ago

but the file was stuff generated with abbyy, even if you give it to someone else.

AlexandrB5y ago

Would be interesting to see if Preview is stripping OCRd text from PDFs not created by ABBYY FineReader.

db48x5y ago

Are you arguing that ABBYY Finereader is going to produce different PDF files once they support the new OS? Possibly they will, if only to work around this obvious bug in Preview.

zepto5y ago

I’m arguing that yes, they will produce different PDF files.

It’s not obvious the bug is in preview. The bug could easily be in in ABBYY’s PDF generation code.

To be fair, I’m not arguing that this is the case. I’m arguing that it just as easily could be as there being a bug in Preview, which is also possible.

1 more reply

userbinator5y ago

I remember many years ago distributing PDFs as part of course material, that Adobe's official reader would open just fine, but Mac's built-in one wouldn't (and simply fail with a useless "an error occurred" message.) Only a small subset of the class was using Macs and the built-in reader, so it took a while to discover. The problem eventually turned out to be some oddity in the way it treats whitespace[1], that Adobe and a few other readers were perfectly fine with, but not Preview.

[1] PDF is one of the strangest file formats I've worked with. It is a bizarre mix of binary and text, and some of the other design decisions are also perplexing.

rubyn00bie5y ago

> PDF is one of the strangest file formats I've worked with.

Do you by chance have a "definitely strangest" file formats? Just curious if something out there is vastly weirder, or more perplexing, than PDFs?

agersant5y ago

I haven't worked with it myself but I heard Photoshop's PSD format is a good candidate.

1 more reply

maximilianburke5y ago

Yes, Adobe's PSD is definitely more weird and perplexing than PDF.

unfocused5y ago

I think the HN crowd has forgotten that the entire legal system uses PDFs, and in addition uses the redaction features of the likes of Adobe Acrobat, as well as others trying to squeeze in like FoxIT.

Redaction is huge in governments that have gone digital. Gone are the days where you print the paper, black it out, and then photocopy it.

I have worked with PDFs for a long time, and if you ever wanted compatibility, you had to use Adobe Pro, since there were so many bad PDFs with weird embedded stuff that only Adobe could read properly...because it was initially created in Adobe sigh

All other products try to catch up, but they can't clean up the mess that Adobe has left behind.

mhh__5y ago

Preview seems like a good example of something that's worth open sourcing. Not only will people end up doing work for you, you get eyes on the code and more direct issue tracking.

Consumers get a product and they still have to go on Mac to use it.

bigbubba5y ago

I've been looking for a FOSS desktop agnostic universal file previewer or thumbnail generator for a while now; if anybody has suggestions I'd love to hear them. Ffmpegthumbnailer for video thumbnails or imagemagick for image thumbnails are fine, but what about previews for things like ebooks or PDFs? Something that provides a one-stop-shop for as many common filetypes as possible is what I'm looking for.

My current solution is controlling a floating mpv window to open image, video or audio files as they are selected. This works well for A/V but not so well with other sorts of documents.

hydrox245y ago

> but what about previews for things like ebooks or PDFs

MuPDF is a great FOSS application and my go-to PDF reader. It lacks fancy annotation, and doesn't even have great text selection and copy/paste, but it is really fast, and has fast search, manipulation, etc.

https://mupdf.com/

duskwuff5y ago

For what it's worth, Preview is a relatively thin shell around Apple's own PDFKit:

https://developer.apple.com/documentation/pdfkit

Whether that could itself be open-sourced is an interesting question. (My concern would be that parts of it might be covered by Adobe NDAs.)

tonyedgecombe5y ago

My understanding is that it is Apple’s own code, they didn’t license it from Adobe.

arvindamirtaa5y ago

>Consumers get a product and they still have to go on Mac to use it.

There will be ports to windows and linux in under a month.

mhh__5y ago

How? I assume it's mostly using proprietary MacOS APIs, and, ignoring that it doesn't really matter beyond apple being an abusive partner, lawyers are a powerful tool (This software is under the Apple don't take the piss licence)

1 more reply

djxfade5y ago

I wouldn't be to sure, Apple's applications usually rely heavily on proprietary Cocoa APIs not available anywhere else

yakubin5y ago

There already is an equivalent feature in GNOME. Preview's code is of little value outside of Mac. It's also not rocket science, that the thing stopping other people from doing the same thing would be lack of access to original source code. :)

fastball5y ago

From what I can tell, there is no reason you can't just run the PDF through ABBYY FineReader again and get the exact same OCR you got the first time, so I think "irreversible" is a bit over-the-top.

Is it as easy as CMD+Z? No. Is it data you can never get back? Also no.

matrixagentOP5y ago

In theory that is probably true – in my actual scenario I can't run them through ABBYY again because of the limitations of the bundled version. It only accepts PDFs coming from the scanner software, so running these through ABBYY again would give me an error message. I'd have to buy the full version to be able to try out that workaround.

non-nil5y ago

On a totally not entirely unrelated note, I have found ExifTool[0] to be quite useful for many tasks. Especially in combination with a bash alias or simple Automator action, to be used in the services menu, or as a droplet or folder action. [0]https://exiftool.org/TagNames/PDF.html

1 more reply

cprecioso5y ago

This happened to me in Catalina as well. This summer I was preparing the paper proceedings for a conference, which were made with InDesign. I had to remove a couple of pages from the output, did so with Preview, and from then on, the text was garbled on copy-pasting. Had to switch to using Acrobat for that step.

juskrey5y ago

Preview for PDF manipulation was a nice try at first, until I realized I suddenly have unexpected problems with produced docs, trouble with drag-and-drop, overwritten files etc..

Now I am using PDFGenius and never looking back.

e405y ago

Let's be real. Every single macOS release, until it reaches x.y.4 or x.y.5 is just in beta and you are the tester.

I upgraded to Catalina when it hit 10.15.6, and I watched for the year since the release all the comments and posts about the horrible things it was doing to their computer, files, apps, etc.

Apple supports the latest 2 versions of macOS. Always be on the "previous" one is my advice. Since my family and friends started following it, they are much happier and more productive.

Let the masses beta test.

fastball5y ago

Is that not like, every piece of software ever?

I don't know very many pieces of widely used / actively developed software that stayed static on X.0.0 for more than a couple weeks after release or so.

e405y ago

No, it's not. I knew I'd get downvotes. Don't mind. I don't say this about macOS lightly. I've been using it since 10.0.

krull105y ago

Seconded for macOS. I usually update when the next major release is about to be announced. By 11 months they usually finally have the bugs worked out. It isn’t always necessary for every yearly release, but once you’ve been burned a few times you learn it’s better to wait for several point releases...

ehutch795y ago

Apple has a lot of shit they need to fix in macOS and the accompanying apps.

That said, the author of this article is clearly an ass, and i have a hard time being sympathetic.

Assuming the pdf is actually in spec, which it's probably not, this shouldn't be happening. That said, if the 3rd party app vendor says the pdfs they generate are broken in big sur, that should tell you, they may be broken other places as well, and it's probably not apple's issue.

matrixagentOP5y ago

Could you explain why or how exactly I'm an ass?

ehutch795y ago

To quote:

"""But Apple didn’t tell me that I can’t upgrade to Big Sur when I use ABBYY"""

1 more reply

cosmotic5y ago

The text corruption doesn't appear to be random. The same word gets converted to the same corruption. It's more likely an encoding/decoding bug.

dev_tty015y ago

Preview used to be solid, but it has been increasingly fragile in recent years. I found PDF Expert to be a great replacement. I have no affiliation.

nerpderp825y ago

> You have to completely close the file and reopen it, only then will you realize that it has been destroyed.

Someone5y ago

At first glance, it’s a replacement cypher. Every ‘a’ becomes a filled square, every ‘b’ a ‘p’, every ‘c’ a ‘(‘, every ‘d’ a ‘)’, etc.

However, there are exceptions, for example the first ‘b’ on line 10. It becomes an ‘ä’ on line 21. I guess that’s because that is bold text, and thus a different font.

rubatuga5y ago

Once again, the Hacker News comments prove to be more useful and insightful than the article itself.

kekeblom5y ago

I had an issue recently where the form contents filled and saved with Preview.app would not show up in acrobat reader. I've encountered this in two cases so far, with two completely different documents.

qwerty4561275y ago

I have encountered too man PDFs (mostly digital originals rather than OCRed scans) corrupted this way during the recent months. Now I see why...

skissane5y ago

I hate Preview's PDF editing features, I wish there was a way to turn them off.

I'm the kind of person who tends to randomly click on things as I read them. In other PDF readers, this is quite harmless. In Preview, it starts editing the PDF. 99.9% of the time I have zero interest in editing or annotating the PDF I am reading. And then when I quit it asks me if I want to save a copy. I never wanted to change it to begin with!

(Maybe it is time I found another PDF reader...)

jordache5y ago

anyone else not able to see sufficient details the tiny screenshots? What was the difference?

sp3325y ago

The difference to look for is between the top half on the right vs the bottom half on the right. The text has been scrambled into random symbols.

Here's a direct link to the 2,240x939 image: https://annoying.technology/media/previeweatingpdfs.png

r00fus5y ago

There is a more detailed image link in the doc.

lisper5y ago

Using Apple devices in general seems like a total crap shoot to me nowadays because of the impossibility of down-grading the OS. Every "upgrade" comes with a considerable risk that something that had been working will stop working, and if that happens, you are pretty much SOL.

fastball5y ago

What? You can definitely downgrade to an earlier MacOS.

It's not a one-click downgrade like the upgrade is, but I don't know of any OS with that feature.

lisper5y ago

> You can definitely downgrade to an earlier MacOS.

Sometimes you can, sometimes you can't. Going from Mavericks to Yosemite for example is one-way because it includes a non-backwards-compatible firmware update. Going to Catalina is also one-way because it changes the file system from HFS to AFS.

And iOS is famously non-downgradable.

rbanffy5y ago

> What? You can definitely downgrade to an earlier MacOS.

Unless they got a brand-new M1-based Mac. Macs usually don't install versions of macOS prior to their launches.

00000111115y ago

Use "Adobe Acrobat Reader DC" for pdf work on macOS v11.1

tonyedgecombe5y ago

I tried that and it was less reliable than Preview.

ProAm5y ago

In what ways?

nt2h9uh238h5y ago

Is this German?

matrixagentOP5y ago

Yes.

anonuser1234565y ago

Time machine?

dewey5y ago

Backups are always great, but if something is broken silently behind your back and you only realize in a few years that your archived documents are not searchable any more that makes it harder to recover.

beamatronic5y ago

Preview should not change the file on disk. I would expect it to open the original file as read-only.

blacksmith_tb5y ago

Yes, the author says it's "the result after modifying (removed a blank page) and saving that same PDF in Preview." So it's not enough to just view the file in Preview.app I take it, but you need to save it out (which still shouldn't strip anything extra, obviously, but is not what I thought was being claimed).

birdyrooster5y ago

So you are saying that they overwrote their file and are upset that the file they overwrote is different from the new file? This is insanity. Clearly a bug in AABBY that it can’t read PDF saved in the standard spec.

PDF is not a bitmap, it’s a script like HTML or JS. People understand browser incompatibility but some how this is unconscionable.

1 more reply

throwaway7446785y ago

I understand it does not: the issue occurs when the user removes another (blank) page, then saves the file.

MrBuddyCasino5y ago

> In the lower half is the result after modifying (removed a blank page) and saving that same PDF in Preview.

I don't think this means Preview changes the files just by opening them.

YetAnotherNick5y ago

PDFs are not intended to be modified. Preview and other readers use hacks to do the work. In general don't modify the PDF and if you really want to do it buy Acrobat reader.

tonyedgecombe5y ago

The PDF file format has a mechanism in it for modifying documents.

sn415y ago

There was something in macos Catalina that broke mupdf on my macbook pro. The view would occupy the lower left corner of the window, and something was clipping the view to the lower quadrant.

I tried installing from source, changing the gl library etc. But it was the same.

Am done with Apple for now. M1 is a bit tempting, but I guess I will wait for the technology to mature, buy a Macbook Air, and run Linux on it.

ehutch795y ago

Why would installing from source change things. Without finding/fixing the bug, you're just using the same compiled code as before

sn415y ago

I was trying avoid library incompatibilities. Pulled everything from the repository and recompile with the latest libraries. I also tried a couple of different libraries. I gave up after a week or so. (What I did not do was to compile the libraries from the source as well.)

I really like mupdf so it was a big nuisance for me to lose that.

j / k navigate · click thread line to collapse

317 comments

mulmen5y ago

I have learned to be scared of my MacBook. Seemingly safe behavior can cause permanent damage. It does completely unexpected things, apparently by design.

Same with ~/Music after iTunes wiped out a bunch of carefully curated metadata. Yes, I did want that album art.

I fat-fingered some key combination in Messages recently and got a prompt confirming I wanted to delete the entire conversation history. I consider myself lucky it bothered to ask.

I can add “view a PDF” to the list of things likely to leave me holding the bag.

cle5y ago

tekacs5y ago

They've finally removed the keyboard shortcut for this in Big Sur. :)

It was Cmd + Delete/Backspace before.

1 more reply

mulmen5y ago

FireBeyond5y ago

Multiple people are complaining that Big Sur is blowing out their speakers on the laptops.

I submitted something yesterday where Big Sur completely breaks DSC for all non-Apple monitors (and in some cases, even those).

Oof.

p1necone5y ago

The more I learn about macs the more I think the "it just works" crowd really mean "I will sacrifice my system "working" sometimes in exchange for having zero configuration options".

iansinnott5y ago

In the past it really did just work and macs were configurable. The trend of limited configuration is more recent (and yes, it's terrible).

netflixandkill5y ago

For a decade or so it pretty much did just work. Alas, nothing gold (or metallic shades of white in this case) can stay.

norswap5y ago

Apple, who sold hardware because they had the best software, now sells software because they have the best hardware.

Funny how things change.

dkonofalski5y ago

Those things are literally not possible to happen without your intervention... ಠ_ಠ

kalleboo5y ago

1 more reply

robertoandred5y ago

Do you have proof of iCloud deleting photos?

meibo5y ago

I assume these are a result of bad UX, at least in my personal experience.

1 more reply

Yetanfou5y ago

[1] where 'Linux' stands for any supported Linux distribution

xxpor5y ago

systemvoltage5y ago

PDF is one of the best things IMO - it's like a docker container for documents. The way the original authors intended it to be, including fonts and all the things that go into making a document.

wtallis5y ago

> PDF is one of the best things IMO - it's like a docker container for documents. The way the original authors intended it to be, including fonts and all the things that go into making a document.

Sounds like you want the PDF/A standard for archiving: https://en.wikipedia.org/wiki/PDF/A

2 more replies

inetknght5y ago

> I want to rely on professional typesetting from the publishers, not some front-end engineer's vision of what the ebook should look like and how it should be presented to the user.

2 more replies

markandrewj5y ago

Generally the PDF copy of a book is designed to fit the format of the final printed book. The PDF is even often the deliverable sent to press.

As others have mentioned there are also a number of security risks associated with PDF.

I can't deny there are several books I have read in PDF format however.

2 more replies

odyssey75y ago

Knuth as a book author has written about this on his website.

https://www-cs-faculty.stanford.edu/~knuth/taocp.html

1 more reply

ernst_klim5y ago

> I personally like to purchase books that are in PDF format, not epub/mobi

PDF is abysmal for books though. 1) You can't scale fonts to fit various screen sizes. 2) It's waaay more expensive to interpret and render, so it affects battery life.

3 more replies

jcelerier5y ago

> The way the original authors intended it to be, including fonts and all the things that go into making a document.

cries in random PDFs that end up being printed with letters spaced by 1 cm

1 more reply

cratermoon5y ago

1 more reply

adrianmonk5y ago

> absolutely DO NOT WANT - is web page like format with auto-flowing text

I would love it, though, if PDF included this as something that's always entirely optional.

Sometimes I want to read something just as formatted. If my display is big enough, and the formatting has any importance, I probably want that.

Other times, my screen is smaller (phone) or the wrong shape (small laptop), and I'd rather the text confirm to the device rather than vice versa.

1 more reply

higerordermap5y ago

> The way the original authors intended it to be, including fonts and all the things that go into making a document.

So that it's very very hard to read on mobile, or in a small width window.

> For example, I personally like to purchase books that are in PDF format, not epub/mobi

exactly the opposite. I want to be able to properly reflow it in a narrow window side by side with another tab on laptop, often for technological books.

ketamine__5y ago

What device do you use to read PDF's?

5 more replies

pklausler5y ago

Would you be happy with some kind of paginated image file format, if it were not significantly larger than PDF?

fomine35y ago

Oppositely what I absolutely WANT - is web page like format with auto-flowing text and something that fits to the screen with user styling/typesetting. I wish both option is available.

1 more reply

hcurtiss5y ago

I agree completely. Have you found any good resources for acquiring books in pdf?

1 more reply

izacus5y ago

> Why anyone treats PDF as anything but a write-once format is beyond me.

sigh These kind of ignorant statemens really annoy me.

- Edited, annotated and commented

- Approved with explicit markers

- Digitally signed by auditors and reviewers with tamper protection

- Rendered on anything that has a semblance of interactive UI with graceful degradation

- Archived on any kind of file storage

is insanely useful and valuable for massive worldwide industries.

TwoBit5y ago

And yet PDF re-writing keeps getting screwed up. Probably because it really is a bad format for this, with the pages of docs notwithstanding.

2 more replies

theamk5y ago

Sure, but the major text is still "write once" -- you can only add specially designated information to a read-only core.

As this article shows, a trivial operation on the core text, like "remove a blank page", is still very hard and pretty easy to get wrong.

Igelau5y ago

le sigh the surface changes you're talking about are small potatoes compared to removing a page and altering document structure without breaking anything.

1 more reply

bigbubba5y ago

gfody5y ago

you can also embed images and fonts in svg, probably not as reliable as pdf for pixel-perfect reproductions everywhere though

1 more reply

monocasa5y ago

Would love something like that with svg allowed inside to support vector drawings.

ASalazarMX5y ago

And this has become the de facto open standard, supported by many document viewers, probably because it's a very straightforward combination of two widely used tools.

yarcob5y ago

Basic PDF editing (adding, removing, reordering, cropping and rotating) has been rock solid in Apple's Preview app. It's something that I dearly miss when I'm on Windows.

Which makes showstopper bugs like this a very unfortunate.

But then again, it's been common knowledge that you shouldn't upgrade to a .0 release...

helmholtz5y ago

I've tried all of the commands you've mentioned, and so far it's worked without a hitch.

1 more reply

crazygringo5y ago

Because you often have no choice.

You have a PDF of a book and you need to export the pages of a single chapter to a new PDF.

Or you have 30 different PDF's that you need to combine into a single one.

Or a full-color large-filesize scanned PDF that you need to convert to a smaller-sized black-and-white one.

Or you need to copy quotes from a PDF to paste somewhere else.

I could go on... PDF's are documents, and normal workflows involve all sorts of conversion and rearrangement and processing of documents. That's the whole point.

_-david-_5y ago

None of those require modifying the existing document. A new document could be produced and the existing document(s) would only be read from.

1 more reply

richard_todd5y ago

Yetanfou5y ago

[0] https://www.pdflabs.com/tools/pdftk-server/

[1] https://en.wikipedia.org/wiki/Comic_book_archive

[2] https://en.wikipedia.org/wiki/DjVu

sildur5y ago

Blind and shortsighted people may not like your idea. There are accessibility options in PDF, but even a regular one can be accessed with a screen reader. Your images won't.

bigbubba5y ago

1 more reply

theamk5y ago

It's not like PDF is the only one with text layer, or the only options are PDF, JPG or PNG.

DjVU can have text layers as well, and I think SVG too?

qwerty4561275y ago

> Why anyone treats PDF as anything but a write-once format is beyond me.

Because PDF is supposed to be like electronic paper and you normally can draw on a paper original.

eddieh5y ago

We have that container format. It is called: TIFF

ruined5y ago

If you don't want to break something, just don't try to edit it. That can't be enforced anywhere but at the individual user level.

Even proprietary formats in first-party software have this problem. Hell, plain-text editors have this problem.

pm2155y ago

I've never wanted to edit a PDF, so personally I'd rather the feature was gated behind a menu option or something so you had to deliberately ask for it.

jrochkind15y ago

Making a PDF include OCR'd text seems to by definition require writing twice. It has to be edited to add the OCR'd text that was not in the original image-only PDF, no?

aardvark1795y ago

tonyedgecombe5y ago

PDF does support appending to the file and it’s not just for adding pages. You can make changes throughout the original document.

I’m not sure that is what Preview does though.

idle_zealot5y ago

> I think we need a multi-image container format. It could be something that's literally a bunch of jpgs/pngs/pick your poison in a tar container

You've just described the .cbt format!

https://en.m.wikipedia.org/wiki/Comic_book_archive

marcan_425y ago

.zip is better, at least that has an index, but at the end of the file, so it's not linearly streamable.

For what it's worth, .pdf is structurally ~the same as .zip; the index is at the end, and blocks are compressed individually.

Choice of archive format matters :-)

1 more reply

nine_k5y ago

PDF is often called "electronic paper", meaning its primary use, typesetting documents on screen and paper alike.

I'd rather view that PDF is an electronic clay tablet. You can put any text or image on it, but once it's formed, you better not try to alter it, lest you break it.

m4635y ago

What annoys me is that there are some .pdf manuals I want to refer to, and when I view them in preview and go to quit it asks if I want to keep my changes or revert them.

WTF? why does viewing a manual - and I only used navigation like page up / page down - cause modifications?

soraminazuki5y ago

You've likely created a text box by double clicking on the document.

ernst_klim5y ago

> Why anyone treats PDF as anything but a write-once format is beyond me.

Hahaha. Quite a lot of people around me think that PDF is a collection of JPEGs glued together (because they mostly see PDF as scanned non-OCRed docs).

thordenmark5y ago

"Why anyone treats PDF as anything but a write-once format is beyond me."

Some of use work in print publishing and being able to edit PDF's is critical.

SoSoRoCoCo5y ago

Here's is one good reason: My company uses digital certificate signing on PDF. It is a very common usage. We have yet to migrate to DocuSign.

cmiller15y ago

> The only programs I'd be reasonable sure wouldn't screw it up are Acrobat itself

Oh no, I've seen acrobat screw with it too

db48x5y ago

There are already a handful of multi-image container formats. They're all images inside zipfiles, of course.

Siira5y ago

We already have CBR/CBZ.

christkv5y ago

CBR and CBZ for comics fits this I think?

CamperBob25y ago

.PDF works just fine. Don't blame the shortcomings of crappy software implementations on the format itself.

We need a format that will still be readable 100 years or more from now, and .PDF serves that purpose.

crazygringo5y ago

I work with a ton of PDF's between my Mac and iPad, and it mostly works but there are still just way too many bugs.

It's a lot of little things, like in Catalina where opening up the sidebar for annotations (comments) seemingly randomized their order. (Big Sur, fortunately, fixed it to be page-order again.)

Or whenever you open the PDF in Books it remembers which page you were on. Except sometimes it doesn't, so you can't really rely on that for saving your place.

Or in Books, if you select some text to copy but accidentally hit the adjacent "select all" in the pop-up menu, and you're dealing with a 400-page PDF, it just locks up and you have to restart it.

Or in Preview if you want to convert a PDF to black-and-white, there's an option for it but your PDF will balloon in filesize to 10x larger or something.

I mean, I could go on and on. It's weird, because Preview is an incredible app, really. But it really is like they build it and then never bother to test if basic workflows reliably work.

inetknght5y ago

> 10% of the glyphs were scrambled ("lik3 thZs"), like some sort of character table corruption.

As I understand it, that's a form copy protection trying to prevent exactly what you're doing.

crazygringo5y ago

[1] https://bugs.chromium.org/p/chromium/issues/detail?id=112849...

Someone5y ago

More likely it’s the effect of font subsetting.

You don’t want to embed entire fonts, though, certainly not if a PDF uses only a few characters of a font.

The memory wise cheapest way of embedding a font leaves out the table that maps glyph numbers back to Unicode code points. If you do that, the PDF reader guesses a 1:1 mapping to ascii/Unicode.

Subletting means you can’t extract all glyphs in a font from a single file, but AFAIK, that’s just a side effect.

I guess this bug drops such a table, or messes up that translation table.

m4635y ago

Try to screenshot with a movie onscreen.

You get a grey rectangle.

drevil-v25y ago

That not remembering the last page you were at in Books drives me crazy.

Jtsummers5y ago

I've been a happy GoodReader user for years. Syncs with a variety of sources, remembers where I was, bookmarks work. Annotations work, but I don't use that much beyond being able to say it works.

1 more reply

avalys5y ago

This is a clickbait, sensationalist headline. “Saving a PDF with Preview in Big Sur can corrupt OCR text added by a third-party program” is more accurate.

matrixagentOP5y ago

jabbany5y ago

Kind of agree the current title is a bit misleading.

As for non-Acrobat software mangling PDFs after editing... Well that's much less surprising. I've even had Acrobat mangle stuff in PDFs after editing...

wil4215y ago

matrixagentOP5y ago

refulgentis5y ago

In my very humble opinion, it's accurate: "Saving a PDF with Preview in Big Sur [Preview in Big Sur] can corrupt OCR text added by a third-party program [is irreversibly destroying PDFs]"

birdyrooster5y ago

PDF is not a bitmap, it’s a script like HTML or JS.

People understand browser incompatibility but some how this is unconscionable.

1 more reply

dkonofalski5y ago

Maybe this can give you some insight. I downvoted your comment simply because it didn't add anything to the discussion and you made an assumption that has several faults in it.

vzqx5y ago

The title is technically accurate, but it's misleading to non-mac users like myself. I assumed the author was using functionality called "Preview" only to view the documents, rather than save them.

There's a big difference between "read-only operation is mangling files" vs. "PDF writer is buggy".

JxLS-cpgbe05y ago

> Not sure how much longer I can keep trying

Keep trying, just with a new account every few months are so. HN has no privacy controls, we must add our own.

birdyrooster5y ago

caminocorner5y ago

If the behavior changes, that's not on me, that's on Preview.

I don't have any issue with this today on my Mac, but I'm glad I didn't upgrade to BigSur. I almost did.

lilyball5y ago

Yes, they used Preview to modify the PDF.

1 more reply

CapriciousCptl5y ago

lilyball5y ago

If you're not going to do the work to figure out what the corruption is, at least include the two PDFs so other people can look at them and see what happened.

dewey5y ago

lilyball5y ago

A radar from 2016 is not useful, that describes an old bug. Just because the symptoms look like something we’ve seen before doesn’t mean it’s the same underlying issue.

1 more reply

matrixagentOP5y ago

> If you're not going to do the work to figure out what the corruption is …

JumpCrisscross5y ago

> neither Apple nor ABBYY pay my salary

This is a fair bar for conversation, in person or online. One can be more demanding of a public write-up.

1 more reply

ztravis5y ago

duskwuff5y ago

This is almost certainly it. I've seen similar issues with copy/paste from poorly constructed PDFs, often ones generated by "print to PDF" features.

arthur2e55y ago

Very old LaTeX PDFs tend to have this issue too. Chances are pretty slim for profs to edit PDFs witb Preview, I think…

1 more reply

lrossi5y ago

I agree. If you look closely, you can see certain patterns repeating, they’re just not English letters. But it definitely looks like natural language, and not random binary dump.

Marioheld5y ago

Also look at the spaces. The length of the words is the same on both texts. So the content is still present just the characters got shifted.

zepto5y ago

They are using software unsupported by the vendor and blaming Apple for the outcome.

jcrawfordor5y ago

As a general rule, if your software package opens a file fine, then writes a broken version of it, seems to all the world like a bug in your software package.

dkonofalski5y ago

If this is a bug with Preview, then that's really, really bad since Preview is a bedrock of macOS and has been for years.

zepto5y ago

“The idea that this behavior in a PDF reader can be excused because the software that generated the PDF was not approved for the operating system the viewer is running on per the vendor.”

Nobody is saying that. The suggestion is that the software that generates the PDF produces corrupt documents.

The fact that the vendor of that software doesn’t approve it for Big Sur suggests that they might be aware that there are problems.

2 more replies

arvindamirtaa5y ago

Here's the rest of the statement that you left out that answers your point.

zepto5y ago

It doesn’t really. Firstly, that’s not what they are mad about, since that hasn’t actually happened to them. They are mad that the pdf’s their scanner produces don’t work properly.

It really isn’t Apple’s fault if someone else is producing bad files that they happened to previously tolerate, especially if that somebody isn’t maintaining their software.

4 more replies

matrixagentOP5y ago

Read it again. This bug could hit you without ever using ABBYY yourself. Apple broke Preview.

zepto5y ago

I read it. It doesn’t show that Apple broke Preview. It says Preview stopped being compatible with PDF’s produced by ABBYY.

Those PDFs could have been buggy all along, and only now be showing up due to improvements in preview.

1 more reply

masswerk5y ago

1 more reply

r00fus5y ago

So is there an active Preview corruption example that doesn't involve ABBYY? I've used FineReader before for a commercial effort, I do remember it being very finicky.

2 more replies

ehutch795y ago

but the file was stuff generated with abbyy, even if you give it to someone else.

AlexandrB5y ago

Would be interesting to see if Preview is stripping OCRd text from PDFs not created by ABBYY FineReader.

db48x5y ago

Are you arguing that ABBYY Finereader is going to produce different PDF files once they support the new OS? Possibly they will, if only to work around this obvious bug in Preview.

zepto5y ago

I’m arguing that yes, they will produce different PDF files.

It’s not obvious the bug is in preview. The bug could easily be in in ABBYY’s PDF generation code.

To be fair, I’m not arguing that this is the case. I’m arguing that it just as easily could be as there being a bug in Preview, which is also possible.

1 more reply

userbinator5y ago

[1] PDF is one of the strangest file formats I've worked with. It is a bizarre mix of binary and text, and some of the other design decisions are also perplexing.

rubyn00bie5y ago

> PDF is one of the strangest file formats I've worked with.

Do you by chance have a "definitely strangest" file formats? Just curious if something out there is vastly weirder, or more perplexing, than PDFs?

agersant5y ago

I haven't worked with it myself but I heard Photoshop's PSD format is a good candidate.

1 more reply

maximilianburke5y ago

Yes, Adobe's PSD is definitely more weird and perplexing than PDF.

unfocused5y ago

Redaction is huge in governments that have gone digital. Gone are the days where you print the paper, black it out, and then photocopy it.

All other products try to catch up, but they can't clean up the mess that Adobe has left behind.

mhh__5y ago

Preview seems like a good example of something that's worth open sourcing. Not only will people end up doing work for you, you get eyes on the code and more direct issue tracking.

Consumers get a product and they still have to go on Mac to use it.

bigbubba5y ago

My current solution is controlling a floating mpv window to open image, video or audio files as they are selected. This works well for A/V but not so well with other sorts of documents.

hydrox245y ago

> but what about previews for things like ebooks or PDFs

https://mupdf.com/

duskwuff5y ago

For what it's worth, Preview is a relatively thin shell around Apple's own PDFKit:

https://developer.apple.com/documentation/pdfkit

Whether that could itself be open-sourced is an interesting question. (My concern would be that parts of it might be covered by Adobe NDAs.)

tonyedgecombe5y ago

My understanding is that it is Apple’s own code, they didn’t license it from Adobe.

arvindamirtaa5y ago

>Consumers get a product and they still have to go on Mac to use it.

There will be ports to windows and linux in under a month.

mhh__5y ago

1 more reply

djxfade5y ago

I wouldn't be to sure, Apple's applications usually rely heavily on proprietary Cocoa APIs not available anywhere else

yakubin5y ago

fastball5y ago

From what I can tell, there is no reason you can't just run the PDF through ABBYY FineReader again and get the exact same OCR you got the first time, so I think "irreversible" is a bit over-the-top.

Is it as easy as CMD+Z? No. Is it data you can never get back? Also no.

matrixagentOP5y ago

non-nil5y ago

1 more reply

cprecioso5y ago

juskrey5y ago

Preview for PDF manipulation was a nice try at first, until I realized I suddenly have unexpected problems with produced docs, trouble with drag-and-drop, overwritten files etc..

Now I am using PDFGenius and never looking back.

e405y ago

Let's be real. Every single macOS release, until it reaches x.y.4 or x.y.5 is just in beta and you are the tester.

I upgraded to Catalina when it hit 10.15.6, and I watched for the year since the release all the comments and posts about the horrible things it was doing to their computer, files, apps, etc.

Apple supports the latest 2 versions of macOS. Always be on the "previous" one is my advice. Since my family and friends started following it, they are much happier and more productive.

Let the masses beta test.

fastball5y ago

Is that not like, every piece of software ever?

I don't know very many pieces of widely used / actively developed software that stayed static on X.0.0 for more than a couple weeks after release or so.

e405y ago

No, it's not. I knew I'd get downvotes. Don't mind. I don't say this about macOS lightly. I've been using it since 10.0.

krull105y ago

ehutch795y ago

Apple has a lot of shit they need to fix in macOS and the accompanying apps.

That said, the author of this article is clearly an ass, and i have a hard time being sympathetic.

matrixagentOP5y ago

Could you explain why or how exactly I'm an ass?

ehutch795y ago

To quote:

"""But Apple didn’t tell me that I can’t upgrade to Big Sur when I use ABBYY"""

1 more reply

cosmotic5y ago

The text corruption doesn't appear to be random. The same word gets converted to the same corruption. It's more likely an encoding/decoding bug.

dev_tty015y ago

Preview used to be solid, but it has been increasingly fragile in recent years. I found PDF Expert to be a great replacement. I have no affiliation.

nerpderp825y ago

> You have to completely close the file and reopen it, only then will you realize that it has been destroyed.

Someone5y ago

At first glance, it’s a replacement cypher. Every ‘a’ becomes a filled square, every ‘b’ a ‘p’, every ‘c’ a ‘(‘, every ‘d’ a ‘)’, etc.

However, there are exceptions, for example the first ‘b’ on line 10. It becomes an ‘ä’ on line 21. I guess that’s because that is bold text, and thus a different font.

rubatuga5y ago

Once again, the Hacker News comments prove to be more useful and insightful than the article itself.

kekeblom5y ago

qwerty4561275y ago

I have encountered too man PDFs (mostly digital originals rather than OCRed scans) corrupted this way during the recent months. Now I see why...

skissane5y ago

I hate Preview's PDF editing features, I wish there was a way to turn them off.

(Maybe it is time I found another PDF reader...)

jordache5y ago

anyone else not able to see sufficient details the tiny screenshots? What was the difference?

sp3325y ago

The difference to look for is between the top half on the right vs the bottom half on the right. The text has been scrambled into random symbols.

Here's a direct link to the 2,240x939 image: https://annoying.technology/media/previeweatingpdfs.png

r00fus5y ago

There is a more detailed image link in the doc.

lisper5y ago

fastball5y ago

What? You can definitely downgrade to an earlier MacOS.

It's not a one-click downgrade like the upgrade is, but I don't know of any OS with that feature.

lisper5y ago

> You can definitely downgrade to an earlier MacOS.

And iOS is famously non-downgradable.

rbanffy5y ago

> What? You can definitely downgrade to an earlier MacOS.

Unless they got a brand-new M1-based Mac. Macs usually don't install versions of macOS prior to their launches.

00000111115y ago

Use "Adobe Acrobat Reader DC" for pdf work on macOS v11.1

tonyedgecombe5y ago

I tried that and it was less reliable than Preview.

ProAm5y ago

In what ways?

nt2h9uh238h5y ago

Is this German?

matrixagentOP5y ago

Yes.

anonuser1234565y ago

Time machine?

dewey5y ago

beamatronic5y ago

Preview should not change the file on disk. I would expect it to open the original file as read-only.

blacksmith_tb5y ago

birdyrooster5y ago

PDF is not a bitmap, it’s a script like HTML or JS. People understand browser incompatibility but some how this is unconscionable.

1 more reply

throwaway7446785y ago

I understand it does not: the issue occurs when the user removes another (blank) page, then saves the file.

MrBuddyCasino5y ago

> In the lower half is the result after modifying (removed a blank page) and saving that same PDF in Preview.

I don't think this means Preview changes the files just by opening them.

YetAnotherNick5y ago

PDFs are not intended to be modified. Preview and other readers use hacks to do the work. In general don't modify the PDF and if you really want to do it buy Acrobat reader.

tonyedgecombe5y ago

The PDF file format has a mechanism in it for modifying documents.

sn415y ago

There was something in macos Catalina that broke mupdf on my macbook pro. The view would occupy the lower left corner of the window, and something was clipping the view to the lower quadrant.

I tried installing from source, changing the gl library etc. But it was the same.

Am done with Apple for now. M1 is a bit tempting, but I guess I will wait for the technology to mature, buy a Macbook Air, and run Linux on it.

ehutch795y ago

Why would installing from source change things. Without finding/fixing the bug, you're just using the same compiled code as before

sn415y ago

I really like mupdf so it was a big nuisance for me to lose that.

j / k navigate · click thread line to collapse