(But maybe wasn't as much on people's radars, with all lower-hanging fruit of other technology choices and practices going on, outside of PDF.)
New code for a large spec is also interesting for potential vulns, but maybe easier to get confidence about.
One neat direction they could go is to be considered more trustworthy than the Adobe products. For example, if one is thinking of a PDF engine as (among other purposes) supporting the use case of a PDF viewer that's an agent of the interests of that individual human user, then I suspect you're going to end up with different attention and decisions affecting security (compared to implementations from businesses focused on other goals).
(I say agent of the individual user, but that can also be aligned with enterprise security, as an alternative to risk management approaches that, e.g., ultimately will decide they're relying on gorillas not to make it through the winter.)
IMHO, having well-engineered tools handle data, and being conservative about the trust/privileges given externally-sourced data is at least complementary to the current "zero trust" thinking among networks and nodes.
(Example: Does your spreadsheet really arbitrary code execution, in an imperfect sandbox, for all your nontechnical users? Should what people might think is a self-contained standalone text document file really phone home, to disclose your activity and location, or have the potential to be remotely memory-holed/disabled, along with attendant added security risks from that added complexity and the additional requirements it puts on host systems/tools to try to enforce that questionable design?)
Zero days will alwsys exist it seems, even Chrome has these, with hundreds of security researchers eyes on it
Not hard
But anyway - I understand why they have changed their interpreter however the lack of major version bump threw me off. I use ps2pdf to optimize pdfs (long story short - makes their size smaller) and was alarmed when my pdfs suddenly ended up without the jpeg backgrounds. Instead, purely black (although this did result in a very small file size so who knows... :) )
Thankfully you can add `-d NEWPDF=false` to your command to use the old parser. I'm yet to submit a bug report but it would be nice if it was backwards compatible...
You can also reach us developers over at our ghostscript Discord channel https://discord.gg/H9GXKwyPvY (https://discord.gg/SnXWzqzjKs for mupdf).
Anyone who has done PDF composition for a "print ready" job (what a lie) from a client has run into this so many times. All we have to do is rearrange the pages in the right sorted order, add some barcodes, and print, right? Acrobat can open the file, so why is your printer crashing? Ironically, some of those printers used an Adobe RIP in the toolchain and this conversion PDF->PS on the printer was where things went wrong (I once tracked down a crash where a font's gylph name definition in the dict was OK in PDF but invalid syntax in PS, due to a // resolving into an immediately evaluated name that doesn't exist) but it's not something a technician could help with.
It was so bad that Ghostscript was one of many tools - we'd throw a PDF through various toolchains to hope one of them saved it in a format that was well behaved. Anyway I'm almost sad I've moved on from that job now so I can't try it out with some real world files. But in the end most of the issues came down to fonts and people using workflows that involve generating single document PDFs and merging them, resulting in things like 1000 subset fonts which are nearly identical and consuming all the printer memory, so I'm not sure how well this would help.
I ended up with a fairly large set of shell scripts over Ghostscript to convert them into high DPI tif's to be able to reliably print them, it worked remarkably well considering that one was open source and free and the other was 1000's per license.
I haven't worked on the innards of those machines but my suspicion is that it's a combination of 1) Not much RAM, to keep costs down, 2) An inability to handle a large number of resources i.e. no swapping out to slow storage on a least-recently-used principle or similar, and 3) extremely strict conformance to avoid surprises in output.
Kudos and thank you to those who maintain it and the associated packages!
Does anyone know of a collection of malformed PDF files? It would be useful for testing PDF processing programs.
(note that the majority of them are relatively-harmless rendering issues but some PDFs here have caused crashes or even RCEs and process takeovers for certain malicious PDFs)
(But still, note: A couple of months ago I wrote a low-level PDF parser—just parse the PDF file's bytes into PDF objects, nothing more—and fed it all the PDF files that happened to be present on my laptop, and ran into some files that (some) PDF viewers open, but even qpdf doesn't. I say "even" because qpdf is really good IMO.)
http://git.ghostscript.com/?p=tests.git;a=tree;f=pdf;h=2ce4f...
They're not all malformed, and they're mostly used for snapshot testing, but they cover a wide range of corner cases.
That said, this GitHub topic may have some pointers: https://github.com/topics/malware-samples
While progress is always nice to see - I am also pleased that we don't necessarily need to update all the scripts that depend on ghostscript at once but can keep them running in their current state.
Even if the application was fine, you would always encounter PS/PDF files in the wild that kept stress-testing the application's memory safety.
Isn't C, their chosen replacement of PostScript, also particularly bad at this?
It is however quite entertaining to read the predictable comments from Rust/Java/C++ fans who are upset that they didn't choose their favourite language.
They seem to be the kings of working with PDFs. I’ve not really looked at the Ghostscript code (and I’m surprised to hear their interpreter was still in postscript), but I’ve looked through the mupdf code and what I saw was really nice.
In any case, I appreciate the work they’ve done in providing fantastic tools to the world for decades now.
James Gosling, inventor of Java, once described him as the "greatest programmer in the world". They both used to work at Sun Microsystems.
I wonder what made them decide to reimplement it instead of reusing their existing code.
AFAICT, it's roughly 30 people, mostly seniors.
> but I’ve looked through the mupdf code and what I saw was really nice.
It is! Best onboarding experience I've ever had.
I'm grinning widely when reading this.
Until last year I had an opportunity to help maintaining a pdf tools written using Golang. This case where a pdf doc that is not conforming with the standard could be opened in Acrobat but not on other pdf reader tools (including ghostscript) came a lot from our clients and I had to find a way to be able to read/extract the content with a minimum issue because of that.
PDF became such a weird mess that I’m not surprised Postscript is now just a subset of it (to a degree), but writing an entirely new interpreter has had to be a hefty chunk of work..
The post has no explananation of this choice. Does anyone know?
For all of C++’s faults, at least it’s possible to use a map (or unordered_set or whatever) and mostly avoid encoding the fact that it’s anything other than an associative container of some sort at the call sites. This is especially true in C++11 or newer with auto.
I don't understand this part of your comment. There's nothing preventing you from designing a nice well-encapsulated map/dictionary data structure in C and I'm sure there are many many libraries that do just that.
I do agree though that having such basic data structures in the standard library, as modern C++ does, is usually preferable.
I do agree that generics are required for modern programming, but for some, the cost of complexity of modern languages (compared to C) and the importance of compatibility seem to outweigh the benefits.
Requiring another skillset, toolchain, etc. is onerous and has to be weighed in those decisions. Rust is cool for sure, but difficult to adopt in brownfield projects because of humans more than tech.
Also, it wasn’t written on in 2022, just made the default now. GS is a venerable codebase, and jumping on a “new” language bandwagon may have seemed dangerous at the time it was started.
All conjecture. I’m not an expert or involved.
> The new PDF interpreter is written entirely in C, but interfaces to the same underlying graphics library as the existing PostScript interpreter. So operations in PDF should render exactly the same as they always have (this is affected slightly by differing numerical accuracy), all the same devices that are currently supported by the Ghostscript family, and any new ones in the future should work seamlessly.
[α] Languages targeting LLVM or supported by GCC are portable to every target machine code / ISA / architecture supported by those toolchains. JVM, JS, etc are portable to all the platforms they support. You don't need to do any extra work (of recompiling) if you use a bytecode VM / platform (for example, like JVM).
Not good!!
It still accumulates CVEs: https://www.sqlite.org/cves.html.
C is a well tested compact language - the fact that Linux kernel, BSD kernels, device drivers and a whole lot of games and physics engines are written for performant systems is a testament to it's reliability.
Additionally, I think it's the sane move. A language which is hot cake today (Yes, Rust) may or may not be in fashion 5 years from now when there's a new hotcake. Choices are made keeping 10-15 years project development in mind
It also has a proven record that no matter what, exploits are bound to happen, making the whole industry turn into hardware memory tagging as the ultimate solution to fix C.