Convert potentially dangerous PDFs to safe PDFs (opens in new tab)

(github.com)

181 pointsdp-hackernews4mo ago66 comments

66 comments

While useful it needs a big red warning to potential leakers. If they were personally served documents (such as via email, while logged in, etc) there really isn't much that can be done to ascertain the safety of leaking it. It's not even safe if there are two or more leakers and they "compare notes" to try and "clean" something for release.

https://en.wikipedia.org/wiki/Traitor_tracing#Watermarking

https://arxiv.org/abs/1111.3597

The watermark can even be contained in the wording itself (multiple versions of sentences, word choice etc stores the entropy). The only moderately safe thing to leak would be a pure text full paraphrasing of the material. But that wouldn't inspire much trust as a source.

crazygringo4mo ago

This doesn't seem to be designed for leakers, i.e. people sending PDF's -- it's specifically for people receiving untrusted files, i.e. journalists.

And specifically about them not being hacked by malicious code. I'm not seeing anything that suggests it's about trying to remove traces of a file's origin.

I don't see why it would need a warning for something it's not designed for at all.

coppsilgold4mo ago

It would be natural for a leaker to assume that the PDF contains something "extra" and to try and and remove it with this method. It may not occur to them that this something extra could be part of the content they are going to get back.

david_shaw4mo ago

From the tool description linked:

> Dangerzone works like this: You give it a document that you don't know if you can trust (for example, an email attachment). Inside of a sandbox, Dangerzone converts the document to a PDF (if it isn't already one), and then converts the PDF into raw pixel data: a huge list of RGB color values for each page. Then, outside of the sandbox, Dangerzone takes this pixel data and converts it back into a PDF.

With this in mind, Dangerzone wouldn't even remove conventional watermarks (that inlay small amounts of text on the image).

I think the "freedomofpress" GitHub repo primed you to think about protecting someone leaking to journalists, but really it's designed to keep journalists (and other security-minded folk) safe from untrusted attachments.

The official website -- https://dangerzone.rocks/ -- is a lot more clear about exactly what the tool does. It removes malware, removes network requests, supports various filetypes, and is open source.

Their about page ( https://dangerzone.rocks/about/ ) shows common use cases for journalists and others.

3eb7988a16634mo ago

Canary traps have been popularized in a few works of fiction. Seems trivial to do in the modern era. The sophisticated version I heard is to make the differences in the white space between individual words/lines/wherever.

[0] https://en.wikipedia.org/wiki/Canary_trap

2 more replies

nextaccountic4mo ago

What about having PDF readers have a restricted mode where it has limited capabilities, with the option to "trust" a PDF to enable potentially dangerous features?

Like IDEs do when you open random projects

alphazard4mo ago

I seem to remember Yahoo finance (I think it was them, maybe someone else) introducing benign errors into their market data feeds, to prevent scraping. This lead to people doing 3 requests instead of just 1, to correct the errors, which was very expensive for them, so they turned it off.

I don't think watermarking is a winning game for the watermarker, with enough copies any errors can be cancelled.

coppsilgold4mo ago

> I don't think watermarking is a winning game for the watermarker, with enough copies any errors can be cancelled.

This is a very common assumption that turns out to be false.

There are Tardos probabilistic codes (see the paper I linked) which have the watermark scale as the square of the traitor count.

For example, with a watermark of just 400 bits, 4 traitors (who try their best to corrupt the watermark) will stand out enough to merit investigation and with 800 bits be accused without any doubt. This is for a binary alphabet, with text you can generate a bigger alphabet and have shorter watermarks.

These are typically intended for tracing pirated content, so they carry the so-called Marking Assumption (if given two or more versions of a piece of content, you must choose one. A pirate isn't going to corrupt or remove a piece of video, that would be unsuitable for leaking). So it would likely be possible to get better results with documents, may require larger watermarks to get such traitors reliably.

alphazard4mo ago

This was a fascinating read, thanks for posting.

I'm not totally convinced that the threat model is realistic. The watermarker has to embed the watermark, the only place to do that is in the least significant bits of whatever the message is. If it's an audio file then the least significant bits of each sample would work. If it's a video file then the LSBs in a DCT bin may also be unnoticeable. It can really only go in certain places, without it affecting the content in a meaningful way. If it's in a header, or separate known location, then the pirate can just delete those bits.

The threat model presented says the pirates have to go with one of the copies, or only correct errors that are different between 2 copies. That's the part that I don't think is realistic. If the pirates knew that the file was marked, and the scheme used to mark it, but didn't know the key (a standard threat model for things like encryption), then they could inject their own noise into wherever the watermark could be hiding, and now the problem is the watermarker trying to send a message on a noisy channel, where the pirates have a jammer. I don't even think you have to sacrifice quality, since the copy you have already has noise, and you just need to inject the same amount (or more).

1 more reply

apyrgiotis4mo ago

Oof, that's a great point. We briefly touched on this a few weeks ago, but from the angle of canary tokens / tracking pixels [1].

Security-wise, our main concern is protecting people who read suspicious documents, such as journalists and activists, but we do have sources/leakers in our threat model as well. Our docs are lacking in this regard, but we will update them with information targeted specifically to non-technical sources/leakers about the following threats:

- Metadata (simple/deep)

- Redactions (surprisingly easy to get wrong)

- Physical watermarking (e.g., printer tracking dots)

- Digital watermarking (what you're pointing out here)

- Fingerprinting (camera, audio, stylometry)

- Canary tokens (not metadata per se, but still a de-anonymization vector)

If you come in FOSDEM next week, we plan to talk about this subject there [2].

The goal here isn't to provide a false sense of security, nor frighten people. It's plain old harm reduction. We know (and encourage) sources to share documents that can help get a story out, but we also want to educate them about the circumstances in which they may contain their PII, so that they can make an informed choice.

[1]: https://social.freedom.press/@dangerzone/115859839710582670

[2]: https://fosdem.org/2026/schedule/event/JZ3F8W-dangerzone_ble...

(Dangerzone dev btw)

robertk4mo ago

Why not leak a dataset of N full text paraphrasings of the material, together with a zero-knowledge proof of how to take one of the paraphrasings and specifically "adjust" it to the real document (revealed in private to trusted asking parties)? Then the leaker can prove they released "at least the one true leak" without incriminating themselves. There is a cryptographic solution to this issue.

rtkwe4mo ago

Wouldn't comparing between two downloads reveal if the files are watermarked immediately though. Especially the sentence or other steganographic watermarks embedded in the text itself should show up pretty clearly to a simple comparison.

normie30004mo ago

> The only moderately safe thing to leak would be a pure text full paraphrasing of the material. But that wouldn't inspire much trust as a source.

Isn't this what newspapers do?

jevinskie4mo ago

Seems like a similar but less elegant solution as parsing and normalization to a “safe” subset but not just blasting it to pixels.

https://github.com/caradoc-org/caradoc

http://spw16.langsec.org/slides/guillaume-endignoux-slides.p...

chaps4mo ago

Heh, I've seen this a bunch of times and it's of interest to me, but honestly? It's sooooo limiting by being an interface without a complementary command line tool. Like, I'd like to put this into some workflows but it doesn't really make sense to without using something like pyautogui. But maybe I'm missing something hidden in the documentation.

tclancy4mo ago

https://github.com/freedomofpress/dangerzone/blob/main/dange...

How hard did you look the other times?

chaps4mo ago

Not much further than their documentation, friend! But thanks for finding that, that's actually super helpful! I hope somebody puts in a pr for updating the documentation to make it clear what functionality their tool has.

tclancy4mo ago

I can show you the link for how to do that too if needed.

1 more reply

crazygringo4mo ago

It seems to meant for end-users like journalists processing files individually like e-mail attachments.

It doesn't seem to be meant for usage at scale -- it's not for general-purpose conversion, as the resulting files are huge, will have OCR errors, etc.

chaps4mo ago

I'm the target audience for this sort of tool. :)

almet4mo ago

(Hi, dangerzone maintainer here)

There is indeed a dangerzone-cli tool¹, and it should be made more visible. We plan on updating/consolidating our docs in the foreseeable future, to make things clearer.

Also, plans are here to make it possible to use dangerzone as a library, which should help use cases like the one you mention.

¹ https://github.com/freedomofpress/dangerzone/blob/main/dange...

chaps4mo ago

Incredible, thanks for sharing! Can't wait to use it for my pdf pipelines :)

mike_d4mo ago

Shameless self promotion: preview.ninja is a site I built that does this and supports 300+ file formats. I'm currently weekend coding version 2.0 which will support 500+ formats and allow direct data extraction in addition to safe viewing.

It is a passion project and will always be free because commercial CDR[1] solutions are insanely expensive and everyone should have access to the tools to compute securely.

1. https://en.wikipedia.org/wiki/Content_Disarm_%26_Reconstruct...

gu0094mo ago

A handy side use for this is compressing PDFs.

For some reason, printing 1 page of an Excel or Word document to a PDF often gets up to around 4MB in size. Passing it through this compresses it quite well.

Just ran a quick test:

- 1-page Excel PDF export: 3.7MB

- Processing with Dangerzone (OCR enabled): 131KB

mikepurvis4mo ago

I wonder if the Excel export is retaining a lot of document structure in the event that it's imported back into Excel again at a later point.

vee-kay4mo ago

Fun trivia: XLSX, DOCX, PPTX are just XML files, you can rename them to ".XML" file extension, and open them in notepad to see their raw contents.

But you can use qpdf or PDFEdit to interpret a PDF's raw code.

https://stackoverflow.com/a/6562443

And thus, you can compare the raw XLSX (XML) vs raw PDF.

crazygringo4mo ago

I don't know if I would do that.

The size is probably font embedding.

And then the OCR will probably not be 100% correct if you ever intend to copy-paste from it.

dfajgljsldkjag4mo ago

I personally just upload them to google drive. It would be a serious pwn if they could somehow still do a compromise through google drive.

bob10294mo ago

Does google drive apply any transformation over the PDF, or are you effectively loading the same document in your browser on the round trip?

venusenvy474mo ago

I often view PDFs in Drive, and it's definitely not just displaying the document with the native web browser. It is rendered with their "Drive renderer", whatever that is. They don't even display a simple .txt file natively in the browser.

Gigachad4mo ago

They have some kind of virus scanner for files you open via a share link. Not sure about the ones you have stored on your own drive unshared.

But probably the main security here is just using the chrome pdf viewer instead of the adobe one. Which you can do without google drive. The browser PDF viewers ignore all the strange and risky parts of the PDF spec that would likely be exploited.

autoexec4mo ago

And yet browser PDF viewers still have vulnerabilities and hackers keep finding sandbox escapes.

creatonez4mo ago

Firefox has a builtin PDF reader, PDF.js, that resides inside of the Javascript sandbox. In theory, it's as safe as loading a webpage.

autoexec4mo ago

So not actually all that safe since sandbox escapes happen all the time. PDF.js has had many vulnerabilities as well

gleenn4mo ago

Do you have any specifics on what Drive does? Any examples of it fixing embedded virii? Or is this blind assumption?

akersten4mo ago

I assume they mean "upload to drive and use the web based reader to view the PDF," not "upload to drive and download it again"

gleenn4mo ago

And what special sauce does the web preview use? At some point, someone has to actually parse and process the data. I feel like on a tech site like Hacker News, speculating that Google has somehow done a perfect job of preventing malicious PDFs beckons the question: how do you actually do that and prove that it's safe? And is that even possible in perpetuity?

4 more replies

apyrgiotis4mo ago

(disclaimer: one of the Dangerzone devs)

That's something I do from time to time as well. AFAIK Google Drive renders all documents on the server-side (which implicitly means that they don't trust the browser sandbox), so that's a reasonable price to pay for less privacy.

Dealing with sensitive documents though is another story, you just can't upload them to a third-party service. That's where projects like Dangerzone come into play.

PaulDavisThe1st4mo ago

Is there some reason why just viewing the PDF with a FLOSS, limited PDF viewer (e.g. atril) would not accomplish the same level of safety? What can a "dangerous PDF" do inside atril?

philipkglass4mo ago

It looks like atril is mostly written in C:

https://github.com/mate-desktop/atril

A crafted PDF can potentially exploit a bug in atril to compromise the recipient's computer since writing memory-safe C is difficult. This approach was famously used by a malware vendor to exploit iMessage through a compressed image format that's part of the PDF standard:

https://projectzero.google/2021/12/a-deep-dive-into-nso-zero...

capitainenemo4mo ago

This is why Firefox chose to implement a custom PDF reader in pure JS for better sandboxing leveraging the existing browser JS sandboxing. As a side effect, it's been a helpful JS library for embedding PDFs on websites.

The Chrome PDF parser, originating from Foxit (now open-sourced as PDFium), has been the source of many exploits in Chrome itself over the years.

almet4mo ago

(Hi, disclaimer: I'm one of the current dangerzone maintainers)

That's a good question :-)

Opening PDFs, or images, or any other document directly inside your machine, even with a limited PDF viewer, potentially exposes your environment to this document.

The reason is that exploits in the image/font/docs parsing/rendering libraries can happen and are exploited in the wild. These exploits make it possible for an attacker to access the memory of the host, and in the worse case allow code execution.

Actually, that's the very threat Dangerzone is designed to protect you from.

We do that by doing the docs to pixel conversion inside a hardened container that uses gVisor to reduce the attack surface ¹

One other way to think about it is to actually consider document rendering unsafe. The approach Dangerzone is taking is to make sure the environment doing the conversion is as unprivileged as possible.

In practice, an attack is still possible, but much more costly: an attacker will be required to do a container escape or find a bug in the Linux kernel/gVisor in addition to finding an exploit in document rendering tools.

Not impossible, but multiple times more difficult.

¹ We covered that in more details in this article https://dangerzone.rocks/news/2024-09-23-gvisor/

zigzag3124mo ago

> The reason is that exploits in the image/font/docs parsing/rendering libraries can happen and are exploited in the wild.

Aren't risks similar when opening any untrusted web page in a browser?

The only difference is that browser sandbox and exploit mitigations are probably better than that of a PDF viewer.

majkinetor4mo ago

Is there any benefit of this tool over opening docs in Windows Sandbox/VM with disabled network? Conversion can be easily done with a simple tool that screenshots each page within the sandbox (could be done for example with few lines of AHK script).

robertk4mo ago

Why not just open it inside of and print to a static image output within a fully sandboxed Docker container?

almet4mo ago

(Hi, disclaimer: I'm one of the current dangerzone maintainers)

You are correct: that's basically what Dangerzone is doing!

The challenges for us are to have a sandbox that keeps being secure and make it possible for non-tech folks (e.g. journalists) to run this in their machines easily.

About the sandbox:

- Making sure that it's still updated requires some work: that's testing new container images, and having a way to distribute them securely to the host machines ;

- In addition to running in a container, we reduce the attack surface by using gVisor¹ ;

- We pass a few flags to the Docker/Podman invocation, effectively blocking network access and reducing the authorized system calls ;

Also, in our case the sandbox doesn't mount the host filesystem in any way, and we're streaming back pixels, that will be then written to a PDF by the host (we're also currently considering adding the option to write back images instead).

The other part of the work is to make that easily accessible to non-tech folks. That means packaging Podman on macOS/Windows, and providing an interface that works on all major OSes.

¹ https://dangerzone.rocks/news/2024-09-23-gvisor/

e404mo ago

Why not upload to Google docs and view there? Way less work.

prmoustache4mo ago

You might not want to make this file, or the fact that you are in posession of this file known by law enforcement.

e404mo ago

Certainly, but that's what, like .0001% of PDFs people encounter?

autoexec4mo ago

Yep. A static image would be better, although I'd also prefer the option of getting a simple text file so that I can get the URLs out of hyperlinks.

tosti4mo ago

I'd rather have 2 minimal (headless, no network, etc) virtual machines. One runs pandoc for the conversion and the other runs ghostscript on the result. Nowadays you can let a web browser run pretty much anything so you don't need to build a vm image anymore.

daft_pink4mo ago

Could we make a method to sanitize PDF’s that preserves the metadata?

It would be better to strip active content like javascript and actions, without flattening the PDF and losing all the text data having the original text is better than sending it through ocr again.

nullc4mo ago

To review documents received from a hostile and dishonest actor in litigation I used disposable VMs in qubes on a computer with a one way (in only) network connection[1], while running the tools (e.g. evince) in valgrind and with another terminal watching attempted network traffic (an approach that did detect attempted network callbacks from some documents but I don't think any were PDFs).

This would have been useful-- but I think I would have layered it on top of other isolation.

([1] constructed from a media converter pair, a fiber splitter to bring the link up on the tx side, and some off the shelf software for multicast file distribution).

snowmobile4mo ago

It's a neat program, but what's the use for JPGs and PNGs?

boston_clone4mo ago

There are some neat detection bypass / compromise methods using various image formats, including PNG [0] and SVG [1]!

I imagine that folks like journalists could have that type of attack in their threat model, and EFF already do a lot of great stuff in this space :)

0. https://isc.sans.edu/diary/31998

1. https://www.cloudflare.com/cloudforce-one/research/svgs-the-...

anthk4mo ago

Why not DJVU with a high DPI instead of a PDF?

rurban4mo ago

Now teach this HR departments. They still ask for Word docs or PDF from untrusted people. ASCII text is frowned upon. Go figure.

The employment readyness check if you can trust a company.

j / k navigate · click thread line to collapse

66 comments

coppsilgold4mo ago

https://en.wikipedia.org/wiki/Traitor_tracing#Watermarking

https://arxiv.org/abs/1111.3597

crazygringo4mo ago

This doesn't seem to be designed for leakers, i.e. people sending PDF's -- it's specifically for people receiving untrusted files, i.e. journalists.

And specifically about them not being hacked by malicious code. I'm not seeing anything that suggests it's about trying to remove traces of a file's origin.

I don't see why it would need a warning for something it's not designed for at all.

coppsilgold4mo ago

david_shaw4mo ago

From the tool description linked:

With this in mind, Dangerzone wouldn't even remove conventional watermarks (that inlay small amounts of text on the image).

Their about page ( https://dangerzone.rocks/about/ ) shows common use cases for journalists and others.

3eb7988a16634mo ago

[0] https://en.wikipedia.org/wiki/Canary_trap

2 more replies

nextaccountic4mo ago

What about having PDF readers have a restricted mode where it has limited capabilities, with the option to "trust" a PDF to enable potentially dangerous features?

Like IDEs do when you open random projects

alphazard4mo ago

I don't think watermarking is a winning game for the watermarker, with enough copies any errors can be cancelled.

coppsilgold4mo ago

> I don't think watermarking is a winning game for the watermarker, with enough copies any errors can be cancelled.

This is a very common assumption that turns out to be false.

There are Tardos probabilistic codes (see the paper I linked) which have the watermark scale as the square of the traitor count.

alphazard4mo ago

This was a fascinating read, thanks for posting.

1 more reply

apyrgiotis4mo ago

Oof, that's a great point. We briefly touched on this a few weeks ago, but from the angle of canary tokens / tracking pixels [1].

- Metadata (simple/deep)

- Redactions (surprisingly easy to get wrong)

- Physical watermarking (e.g., printer tracking dots)

- Digital watermarking (what you're pointing out here)

- Fingerprinting (camera, audio, stylometry)

- Canary tokens (not metadata per se, but still a de-anonymization vector)

If you come in FOSDEM next week, we plan to talk about this subject there [2].

[1]: https://social.freedom.press/@dangerzone/115859839710582670

[2]: https://fosdem.org/2026/schedule/event/JZ3F8W-dangerzone_ble...

(Dangerzone dev btw)

robertk4mo ago

rtkwe4mo ago

normie30004mo ago

> The only moderately safe thing to leak would be a pure text full paraphrasing of the material. But that wouldn't inspire much trust as a source.

Isn't this what newspapers do?

jevinskie4mo ago

Seems like a similar but less elegant solution as parsing and normalization to a “safe” subset but not just blasting it to pixels.

https://github.com/caradoc-org/caradoc

http://spw16.langsec.org/slides/guillaume-endignoux-slides.p...

chaps4mo ago

tclancy4mo ago

https://github.com/freedomofpress/dangerzone/blob/main/dange...

How hard did you look the other times?

chaps4mo ago

tclancy4mo ago

I can show you the link for how to do that too if needed.

1 more reply

crazygringo4mo ago

It seems to meant for end-users like journalists processing files individually like e-mail attachments.

It doesn't seem to be meant for usage at scale -- it's not for general-purpose conversion, as the resulting files are huge, will have OCR errors, etc.

chaps4mo ago

I'm the target audience for this sort of tool. :)

almet4mo ago

(Hi, dangerzone maintainer here)

There is indeed a dangerzone-cli tool¹, and it should be made more visible. We plan on updating/consolidating our docs in the foreseeable future, to make things clearer.

Also, plans are here to make it possible to use dangerzone as a library, which should help use cases like the one you mention.

¹ https://github.com/freedomofpress/dangerzone/blob/main/dange...

chaps4mo ago

Incredible, thanks for sharing! Can't wait to use it for my pdf pipelines :)

mike_d4mo ago

It is a passion project and will always be free because commercial CDR[1] solutions are insanely expensive and everyone should have access to the tools to compute securely.

1. https://en.wikipedia.org/wiki/Content_Disarm_%26_Reconstruct...

gu0094mo ago

A handy side use for this is compressing PDFs.

For some reason, printing 1 page of an Excel or Word document to a PDF often gets up to around 4MB in size. Passing it through this compresses it quite well.

Just ran a quick test:

- 1-page Excel PDF export: 3.7MB

- Processing with Dangerzone (OCR enabled): 131KB

mikepurvis4mo ago

I wonder if the Excel export is retaining a lot of document structure in the event that it's imported back into Excel again at a later point.

vee-kay4mo ago

Fun trivia: XLSX, DOCX, PPTX are just XML files, you can rename them to ".XML" file extension, and open them in notepad to see their raw contents.

But you can use qpdf or PDFEdit to interpret a PDF's raw code.

https://stackoverflow.com/a/6562443

And thus, you can compare the raw XLSX (XML) vs raw PDF.

crazygringo4mo ago

I don't know if I would do that.

The size is probably font embedding.

And then the OCR will probably not be 100% correct if you ever intend to copy-paste from it.

dfajgljsldkjag4mo ago

I personally just upload them to google drive. It would be a serious pwn if they could somehow still do a compromise through google drive.

bob10294mo ago

Does google drive apply any transformation over the PDF, or are you effectively loading the same document in your browser on the round trip?

venusenvy474mo ago

Gigachad4mo ago

They have some kind of virus scanner for files you open via a share link. Not sure about the ones you have stored on your own drive unshared.

autoexec4mo ago

And yet browser PDF viewers still have vulnerabilities and hackers keep finding sandbox escapes.

creatonez4mo ago

Firefox has a builtin PDF reader, PDF.js, that resides inside of the Javascript sandbox. In theory, it's as safe as loading a webpage.

autoexec4mo ago

So not actually all that safe since sandbox escapes happen all the time. PDF.js has had many vulnerabilities as well

gleenn4mo ago

Do you have any specifics on what Drive does? Any examples of it fixing embedded virii? Or is this blind assumption?

akersten4mo ago

I assume they mean "upload to drive and use the web based reader to view the PDF," not "upload to drive and download it again"

gleenn4mo ago

4 more replies

apyrgiotis4mo ago

(disclaimer: one of the Dangerzone devs)

Dealing with sensitive documents though is another story, you just can't upload them to a third-party service. That's where projects like Dangerzone come into play.

PaulDavisThe1st4mo ago

Is there some reason why just viewing the PDF with a FLOSS, limited PDF viewer (e.g. atril) would not accomplish the same level of safety? What can a "dangerous PDF" do inside atril?

philipkglass4mo ago

It looks like atril is mostly written in C:

https://github.com/mate-desktop/atril

https://projectzero.google/2021/12/a-deep-dive-into-nso-zero...

capitainenemo4mo ago

The Chrome PDF parser, originating from Foxit (now open-sourced as PDFium), has been the source of many exploits in Chrome itself over the years.

almet4mo ago

(Hi, disclaimer: I'm one of the current dangerzone maintainers)

That's a good question :-)

Opening PDFs, or images, or any other document directly inside your machine, even with a limited PDF viewer, potentially exposes your environment to this document.

Actually, that's the very threat Dangerzone is designed to protect you from.

We do that by doing the docs to pixel conversion inside a hardened container that uses gVisor to reduce the attack surface ¹

Not impossible, but multiple times more difficult.

¹ We covered that in more details in this article https://dangerzone.rocks/news/2024-09-23-gvisor/

zigzag3124mo ago

> The reason is that exploits in the image/font/docs parsing/rendering libraries can happen and are exploited in the wild.

Aren't risks similar when opening any untrusted web page in a browser?

The only difference is that browser sandbox and exploit mitigations are probably better than that of a PDF viewer.

majkinetor4mo ago

robertk4mo ago

Why not just open it inside of and print to a static image output within a fully sandboxed Docker container?

almet4mo ago

(Hi, disclaimer: I'm one of the current dangerzone maintainers)

You are correct: that's basically what Dangerzone is doing!

The challenges for us are to have a sandbox that keeps being secure and make it possible for non-tech folks (e.g. journalists) to run this in their machines easily.

About the sandbox:

- Making sure that it's still updated requires some work: that's testing new container images, and having a way to distribute them securely to the host machines ;

- In addition to running in a container, we reduce the attack surface by using gVisor¹ ;

- We pass a few flags to the Docker/Podman invocation, effectively blocking network access and reducing the authorized system calls ;

The other part of the work is to make that easily accessible to non-tech folks. That means packaging Podman on macOS/Windows, and providing an interface that works on all major OSes.

¹ https://dangerzone.rocks/news/2024-09-23-gvisor/

e404mo ago

Why not upload to Google docs and view there? Way less work.

prmoustache4mo ago

You might not want to make this file, or the fact that you are in posession of this file known by law enforcement.

e404mo ago

Certainly, but that's what, like .0001% of PDFs people encounter?

autoexec4mo ago

Yep. A static image would be better, although I'd also prefer the option of getting a simple text file so that I can get the URLs out of hyperlinks.

tosti4mo ago

daft_pink4mo ago

Could we make a method to sanitize PDF’s that preserves the metadata?

It would be better to strip active content like javascript and actions, without flattening the PDF and losing all the text data having the original text is better than sending it through ocr again.

nullc4mo ago

This would have been useful-- but I think I would have layered it on top of other isolation.

([1] constructed from a media converter pair, a fiber splitter to bring the link up on the tx side, and some off the shelf software for multicast file distribution).

snowmobile4mo ago

It's a neat program, but what's the use for JPGs and PNGs?

boston_clone4mo ago

There are some neat detection bypass / compromise methods using various image formats, including PNG [0] and SVG [1]!

I imagine that folks like journalists could have that type of attack in their threat model, and EFF already do a lot of great stuff in this space :)

0. https://isc.sans.edu/diary/31998

1. https://www.cloudflare.com/cloudforce-one/research/svgs-the-...

anthk4mo ago

Why not DJVU with a high DPI instead of a PDF?

rurban4mo ago

Now teach this HR departments. They still ask for Word docs or PDF from untrusted people. ASCII text is frowned upon. Go figure.

The employment readyness check if you can trust a company.

j / k navigate · click thread line to collapse