Recreating Epstein PDFs from raw encoded attachments (opens in new tab)

(neosmart.net)

544 pointsComputerGuru1mo ago201 comments

201 comments

dperfect1mo ago

Nerdsnipe confirmed :)

Claude Opus came up with this script:

https://pastebin.com/ntE50PkZ

It produces a somewhat-readable PDF (first page at least) with this text output:

https://pastebin.com/SADsJZHd

(I used the cleaned output at https://pastebin.com/UXRAJdKJ mentioned in a comment by Joe on the blog page)

pests1mo ago

So it was a public event attended by 450 people:

https://www.mountsinai.org/about/newsroom/2012/dubin-breast-...

https://www.businessinsider.com/dubin-breast-center-benefit-...

Even names match up, but oddly the date is different.

elmomle1mo ago

Your links are for the inaugural (first) ball in December 2011; OP's text referred to a second annual ball in December 2012.

1 more reply

turtlesdown111mo ago

interesting, Eva Dubin was highlighted today for offering Epstein her 15 year old daughter and her friends.

She's a medical doctor, who became amnesic when on the stand for Maxwell's case

>Pressed about gaps in her memory, Dubin told the court: "It's very hard for me to remember anything far back and sometimes I can't remember things from last month. My family notices it. I notice it."

nialv71mo ago

looks like we have it. in the end it's pretty mundane...

3 more replies

notpushkin1mo ago

> It produces a somewhat-readable PDF (first page at least) with this text output

Any chance you could share a screenshot / re-export it as a (normalized) PDF? I’m curious about what’s in there, but all of my readers refuse to open it.

dperfect1mo ago

Screenshot: https://imgur.com/eWCfYYd

dperfect1mo ago

Letting Claude work a little longer produced this behemoth of a script (which is supposed to be somewhat universal in correcting similar OCR'd PDFs - not yet tested on any others though): https://pastebin.com/PsaFhSP1

which uses this Rust zlib stream fixer: https://pastebin.com/iy69HWXC

and gives the best output I've seen it produce: https://imgur.com/itYWblh

This is using the same OCR'd text posted by commenter Joe.

daveguy1mo ago

> which is supposed to be somewhat universal in correcting similar OCR'd PDFs

Xerox would like a word.

https://news.ycombinator.com/item?id=29223815

Point being, "correcting" to "correct looking" may be worse than just accepting errors. Errors are often clearly identified by humans as a nonsense word. "Correcting" OCR can result in plausible, but wrong results that are more difficult for the human in the loop to identify.

1 more reply

the_real_cher1mo ago

This is cool!

bawolff1mo ago

Teseract supports being trained for specific fonts, that would probably be a good starting point

https://pretius.com/blog/ocr-tesseract-training-data

pyrolistical1mo ago

It decodes to binary pdf and there are only so many valid encodings. So this is how I would solve it.

1. Get an open source pdf decoder

2. Decode bytes up to first ambiguous char

3. See if next bits are valid with an 1, if not it’s an l

4. Might need to backtrack if both 1 and l were valid

By being able to quickly try each char in the middle of the decoding process you cut out the start time. This makes it feasible to test all permutations automatically and linearly

pletnes1mo ago

You might need to backtrack a lot more, due to the intermediate compression step?

bawolff1mo ago

Sounds like a job for afl

percentcer1mo ago

This is one of those things that seems like a nerd snipe but would be more easily accomplished through brute forcing it. Just get 76 people to manually type out one page each, you'd be done before the blog post was written.

jjwiseman1mo ago

Or one person types 76 pages. This is a thing people used to do, not all that infrequently. Or maybe you have one friend who will help–cool, you just cut the time in half.

wildzzz1mo ago

Typing 76 pages is easy when it's words in a language you understand. WPM is going to be incredibly slow when you actually have to read every character. On top of that, no spaces and no spellcheck so hopefully you didn't miss a character.

1 more reply

sjducb1mo ago

The first week of my PHD was accurately copying DNA sequences from an old paper into a computer file. 10 pages in total. I used OCR to make an initial version then text to speech to check it

76 pages is a couple of months of work

quuxplusone1mo ago

As TFA says, the hard part is that "1" and "l" look the same in the selected typeface. Whether your OCR is done by computers or humans, you still have to deal with that problem somehow. You still need to do the part sketched out e.g. by pyrolistical in [1] and implemented by dperfect in [2].

[1] - https://news.ycombinator.com/item?id=46906897

[2] - https://news.ycombinator.com/item?id=46916065

fragmede1mo ago

> Just get 76 people

I consider myself fairly normal in this regard, but I don't have 76 friends to ask to do this, so I don't know how I'd go about doing this. Post an ad on craigslist? Fiverr? Seems like a lot to manage.

jazzyjackson1mo ago

First, build a fanbase by streaming on Twitch.

Krutonium1mo ago

Amazon Mechanical Turk?

1 more reply

WolfeReader1mo ago

You think compelling 76 people to honestly and accurately transcribe files is something that's easy and quick to accomplish.

altairprime1mo ago

Non-engineers are perfectly willing to volunteer their time to do drudgery. It's one of my opseng career's distinguishing specialties: I'll do drudgery rather than code when appropriate, rather than avoiding it or sulking about it (as was a common response at work for some number of decades!). Learned that lesson when I was 18 from an internship (where I completely failed to deliver any work product due to trying to code around the work). It's part of why I'm going into accounting: apparently having the stamina for dreary work is rare?!

Also look up double/triple data-entry systems, where you have multiple people enter the data and then flag and resolve differences. Won't protect you from your staff banding together to fuck you over with maliciously bad data, but it's incredibly effective to ensure people were Actually Working Their Blocks under healthy circumstances.

pbhjpbhj1mo ago

Captcha!

estimator72921mo ago

Friend, have you ever heard of secretaries?

legitster1mo ago

Given how much of a hot mess PDFs are in general, it seems like it would behoove the government to just develop a new, actually safe format to standardize around for government releases and make it open source.

Unlike every other PDF format that has been attempted, the federal government doesn't have to worry about adoption.

gucci-on-fleek1mo ago

XPS [0] seems to meet these criteria. It supports most of the features of PDF, is an "official" standard, has decent software support (including lots of open source programs), and uses a standard file format (XML). But the tooling is quite a bit worse than it is for PDF, and the file format is still complex enough that redaction would probably be just as hard.

DjVu [1] would be another option. It has really good open source tooling available, but it supports substantially less features than PDF, making it not really suitable as a drop-in replacement. The format is relatively simple though, so redaction should be fairly doable.

TIFF [2] is already occasionally used for government documents, but it's arguably more complex than PDF, so probably not a good choice for this.

[0]: https://en.wikipedia.org/wiki/Open_XML_Paper_Specification

[1]: https://en.wikipedia.org/wiki/DjVu

[2]: https://en.wikipedia.org/wiki/TIFF

Spooky231mo ago

You’re thinking about this as a nerd.

It’s not a tools problem, it’s a problem of malicious compliance and contempt for the law.

legitster1mo ago

Even the previous justice departments struggled with PDFs. The way they handled it was scrubbing all possible metadata and uploading it as images.

For example, when the Mueller reports were released with redactions, they had no searchable text or meta data because they were worried about these exact kind of data leaks.

However, vast troves of unsearchable text is not a huge win for transparency.

PDFs are just a garbage format and even good administrations struggle.

Ekaros1mo ago

I give any new document format 3 to 5 years until it ends up with similar mess. And that is if it starts well designed and limited.

derwiki1mo ago

JPEG?

legitster1mo ago

That's not really comparable - It needs to be editable and searchable.

1 more reply

recursive1mo ago

Lossy

1 more reply

ChocMontePy1mo ago

You can use the justice.gov search box to find several different copies of that same email.

The copy linked in the post:

https://www.justice.gov/epstein/files/DataSet%209/EFTA004004...

Three more copies:

https://www.justice.gov/epstein/files/DataSet%2010/EFTA02153...

https://www.justice.gov/epstein/files/DataSet%2010/EFTA02154...

Perhaps having several different versions might make it easier.

ChocMontePy1mo ago

Also, I found a different base64 encoding with a different font here:

https://www.justice.gov/epstein/files/DataSet%209/EFTA007755...

This doesn't solve the "1 & l" problem for the pdf you are looking at, but it could be useful anyway.

ChocMontePy1mo ago

And this might be a copy of the original pdf:

https://www.justice.gov/epstein/files/DataSet%2011/EFTA02702...

1 more reply

JKCalhoun1mo ago

File is gone now, hmmm…

1 more reply

tcgv1mo ago

> Then my mom wrote the following: “be careful not to get sucked up in the slime-machine going on here! Since you don’t care that much about money, they can’t buy you at least.”

I'm lucky to have parents with strong values. My whole life they've given me advice, on the small stuff and the big decisions. I didn't always want to hear it when I was younger, but now in my late thirties, I'm really glad they kept sharing it. In hidhsight I can see the life-experience / wisdom in it, and how it's helped and shaped me.

pavel_lishin1mo ago

I think this was meant to be a reply to https://news.ycombinator.com/item?id=46903929 ?

tcgv1mo ago

Indeed! Thanks for pointing that out. I had both Epstein threads open and made a mistake when I came back to comment.

pimlottc1mo ago

Why not just try every permutation of (1,l)? Let’s see, 76 pages, approx 69 lines per page, say there’s one instance of [1l] per line, that’s only… uh… 2^5244 possibilities…

Hmm. Anyone got some spare CPU time?

wahern1mo ago

It should be much easier than that. You should should be able to serially test if each edit decodes to a sane PDF structure, reducing the cost similar to how you can crack passwords when the server doesn't use a constant-time memcmp. Are PDFs typically compressed by default? If so that makes it even easier given built-in checksums. But it's just not something you can do by throwing data at existing tools. You'll need to build a testing harness with instrumentation deep in the bowels of the decoders. This kind of work is the polar opposite of what AI code generators or naive scripting can accomplish.

JKCalhoun1mo ago

Not necessarily a PDF attachment?

Someone who made some progress on one Base64 attachment got some XMP metadata that suggested a photo from an iPhone. Now I don't know if that photo was itself embedded in a PDF, but perhaps getting at least the first few hundred bytes decoded (even if it had to be done manually) would hint at the file-type of the attachment. Then you could run your tests for file fidelity.

1 more reply

cluckindan1mo ago

On the contrary, that kind of one-off tooling seems a great fit for AI. Just specify the desired inputs, outputs and behavior as accurately as possible.

1 more reply

sznio1mo ago

>It should be much easier than that. You should should be able to serially test if each edit decodes to a sane PDF structure

that's pointed out in the article. It's easy for plaintext sections, but not for compressed sections. Didn't notice any mention of checksums.

pimlottc1mo ago

I wonder if you could leverage some of the fuzzing frameworks tools like Jepsen rely on. I’m sure there’s got to be one for PDF generation.

kalleboo1mo ago

Easy, just start a crypto currency (Epsteincoin?) based on solving these base64 scans and you'll have all the compute you could ever want just lining up

yatopifo1mo ago

Please don’t give ideas to Nvidia.

kevin_thibedeau1mo ago

pdftoppm and Ghostscript (invoked via Imagemagick) re-rasterize full pages to generate their output. That's why it was slow. Even worse with a Q16 build of Imagemagick. Better to extract the scanned page images directly with pdfimages or mutool.

Followup: pdfimages is 13x faster than pdftoppm

masfuerte1mo ago

This. Not only is it faster, the images are likely to be of better quality. If you rasterize the pages then the images will be scaled, unless you get very lucky.

chrisjj1mo ago

> it’s safe to say that Pam Bondi’s DoJ did not put its best and brightest on this

Or worse. She did.

winddude1mo ago

there are a few messaging conversations between FB agents early on that are kind of interesting. It would be very interesting to see them about the releases. I sometimes wonder if some was malicious compliance... ie, do a shitty job so the info get's out before it get re-redacted... we can hope...

krupan1mo ago

I am in no way a republican apologist, but how many people were clamoring for the immediate releasing these documents, saying it "should be easy" and all that? Laws were passed ordering their sudden speedy disclosure. How would you have handled this?

deadbabe1mo ago

Released all files as is, no redactions.

chrisjj1mo ago

Sudden speedy immediate didn't happen.

If I was Pam? I wouldn't have been.

If she was me, start earlier, hire better, end later.

eek21211mo ago

I mean, the internet is finding all her mistakes for her. She is actually doing alright with this. Crowdsource everything, fix the mistakes. lol.

TSiege1mo ago

This would be funnier if it wasn’t child porn being unredacted by our government

3 more replies

helterskelter1mo ago

I wonder if this could be intentional. If the datasets are contaminated with CSAM, anybody with a copy is liable to be arrested for possession.

More likely it's just an oversight, but it could also be CYA for dragging their feet, like "you rushed us, and look at these victims you've retraumatized". There are software solutions to find nudity and they're quite effective.

2 more replies

dagi3d1mo ago

the issue is that mistakes can't be fixed in the sense once they are discovered, it doesn't matter if they are eventually redacted

chrisjj1mo ago

Let's see her sued for leaking PII. Here in Europe, she'd be mincemeat.

1 more reply

rockskon1mo ago

Yeah - they'll take these lessons learned for future batches of releases.

rcakebread1mo ago

Sicko.

bushbaba1mo ago

This proves my paranoia that you should print and rescan redactions. That or do screenshots of the pdf redacted and convert back to a pdf

Snoozus1mo ago

this would not have helped here

phanimahesh1mo ago

How would that help in this case?

velaia1mo ago

Bummer that it's not December - the https://www.reddit.com/r/adventofcode/ crows would love this puzzle

nubg1mo ago

Wait would this give us the unredacted PDFs?

ryanSrich1mo ago

That's the idea yeah. There are other people actively working on this. You can follow vx-underground on twitter. They're tracking it.

poyu1mo ago

I think it's the PDF files that were attached to the emails, since they're base64 encoded.

sznio1mo ago

From the unredacted attachments you could figure out what the redacted content most likely contains. Just like the other sloppy redactions that sometimes hide one party of the conversation, sometimes the other, so you can easily figure out the both sides.

alhamdulillah231mo ago

Got it.

Page 1: https://imgur.com/a/jwgu9uH

Page 2: https://imgur.com/a/4Zi3bkk

Use this: https://github.com/KoKuToru/extract_attachment_EFTA00400459

iwontberude1mo ago

This one is irresistible to play with. Indeed a nerd snipe.

netsharc1mo ago

I doubt the PDF would be very interesting. There are enough clues in the human-readable parts: it's an invite to a benefit event in New York (filename calls it DBC12) that's scheduled on December 10, 2012, 8pm... Good old-fashioned searching could probably uncover what DBC12 was, although maybe not, it probably wasn't a public event.

The recipient is also named in there...

RajT881mo ago

There's potentially a lot of files attached and printed out in this fashion.

The search on the DOJ website (which we shouldn't trust), given the query: "Content-Type: application/pdf; name=", yields maybe a half dozen or so similarly printed BASE64 attachments.

There's probably lots of images as well attached in the same way (probably mostly junk). I deleted all my archived copies recently once I learned about how not-quite-redacted they were. I will leave that exercise to someone else.

notenlish1mo ago

There's 70 results that come out when searching for "application/pdf" on the doj website

1 more reply

linuxguy21mo ago

Love this, absolutely looking forward to some results.

Evidlo1mo ago

I took at stab at training Tesseract and holy jeebus is their CLI awful. Just an insanely complicated configuration procedure.

subscribed1mo ago

Gods, I had a flashback just from you mentioning that.

I had a reasonably simple problem to solve, slightly weird font and some 10 words in English (I actually only missed one or two blocks for missing letters to cover all I needed).

After a couple of days having almost everything (?) I just surrendered. This seems to be intentionally hostile. All the docs scattered across several repositories, no comprehensive examples, etc.

Absolutely awful piece of software from this end (training the last gen).

queenkjuul1mo ago

I'm only here to shout out fish shell, a shell finally designed for the modern world of the 90s

FarmerPotato1mo ago

If only Base64 had used a checksum.

zahlman1mo ago

"had used"? Base64 is still in very common use, specifically embedded within JSON and in "data URLs" on the Web.

bahmboo1mo ago

"had" in the sense of when it was designed and introduced as a standard

ks20481mo ago

I wonder if jmail (https://www.jmail.world/) has worked on this?

I tried to find the message in this blog post, but couldn't. (don't see how to search by date).

blindriver1mo ago

On one hand, the DOJ gets shit because it was taking too long to produce the documents, and then on another, they get shit because there are mistakes in the redacting because there are 3 million pages of documents.

tclancy1mo ago

It really doesn’t matter which foot you use to step on your own dick. This could not have been more mishandled if they gave it to an actual snake.

rexpop1mo ago

"On the one hand the chef gets shit for taking too long, and then on another for undercooked, badly plated dishes."

Incompetence is incompetence.

rapind1mo ago

What they are redacting is pretty questionable though. Entire pages being suspiciously redacted with no explanation (which they are supposed to provide). This is just my opinion, but I think it's pretty hard to defend them as making an honest and best effort here. Remember they all lied about and changed their story on the Epstein "files" several times now (by all I mean Bondi, Patel, Bongino, and Trump).

It's really really hard to give them the benefit of the doubt at this point.

Rebelgecko1mo ago

My favorite is that sometimes they redact the word "don't". Not only does it totally change the meaning of whatever sentence it's in, the conspiracy theory is that they had a Big Dumb Regex for redacting /Don\W+T/i to remove Trump references

thereisnospork1mo ago

Considering the justice to document ratio that's kind of on them regardless.

subscribed1mo ago

It's pretty clear who they should be reacting (victims/minors) and who they shouldn't (perpetrators).

They wasted months erasing Trump from that instead. So it's on them.

krupan1mo ago

Government is bad at stuff, and more news at 11

hypeatei1mo ago

The zeitgeist around the files started with MAGA and their QAnon conspiracy. All the right wing podcasters were pushing a narrative that Trump was secretly working to expose and takedown a global child sex trafficking ring. Well, it turns out, unsurprisingly, that Trump was implicated too and that's when they started to do a 180. You can't have your cake and eat it too.

zahlman1mo ago

> …but good luck getting that to work once you get to the flate-compressed sections of the PDF.

A dynamic programming type approach might still be helpful. One version or other of the character might produce invalid flate data while the other is valid, or might give an implausible result.

yunnpp1mo ago

Time to flex those Leetcode skills.

winddude1mo ago

here's another few to decode,

https://www.justice.gov/epstein/files/DataSet%2010/EFTA01804...

https://www.justice.gov/epstein/files/DataSet%209/EFTA007755...

https://www.justice.gov/epstein/files/DataSet%209/EFTA004349...

and than this one judging by the name of the file (hanna something) and content of the email:

"Here is my girl, sweet sparkling Hanna=E2=80=A6! I am sure she is on Skype "

maybe more sinister (so be careful, i have no ideas what the laws are if you uncover you know what trump and Epstein were into)...

https://www.justice.gov/epstein/files/DataSet%2011/EFTA02715...

[Above is probably a legit modeling CV for HANNA BOUVENG, based on, https://www.justice.gov/epstein/files/DataSet%209/EFTA011204..., but still creepy, and doesn't seem like there's evidence of her being a victim]

Enhaj121mo ago

Regarding EFTA00434905

I tried and got alot of errors, cant seem to fix it, due to corruption.

https://www.docfly.com/editor/fa3bcb1fa9e8d2629b32/v9r21qsju...

Tried to get AI to guess the remaining text: https://pastebin.com/Z9X2d510

netsharc1mo ago

Geezus, with the short CV in your profile, you couldn't tell an LLM to decode "filename=utf-8"CV%5F%5F%5FHanna%5FTr%C3%A4ff%5F.pdf"? That's not "Bouveng".

Anyway searching for the email sender's name, there's a screenshot of an email of hers in English offering him a girl as an assistant who is "in top physical shape" (probably not this Hanna girl). That's fucking creepy: https://www.expressen.se/nyheter/varlden/epsteins-lofte-till...

winddude1mo ago

not sure how I missed the url encoding. yea, fuck not sure I want to decode that PDF, and their's a high probability that that's a victims name.

Wonder why there's so many random case files in the files.

Snoozus1mo ago

this one has a better font, might be a simple copy&paste job

winddude1mo ago

I've checked for copy and paste, there's so many character flaws, their OCR must have sucked really bad, I may try with deepseekOCR or something. I mean the database would probably more searchable if someone ran every file through a better OCR.

eek21211mo ago

Honestly, this is something that should've been kept private, until each and every single one of the files is out in the open. Sure, mistakes are being made, but if you blast them onto the internet, they WILL eventually get fixed.

Cool article, however.

misja1111mo ago

Won't that entire DOJ archive already be downloaded for backup by several people? If I'd be a journalist working on those files, this is the very first thing I would do as soon as those files were published. Just to make sure you have the originals before DOJ can start adding more redactions.

1 more reply

SomaticPirate1mo ago

Are there archives of this? I have no doubt after this post goes viral some of these files might go “missing” Having a large number of conspiracies validated has lead me to firmly plant my aluminum hat

direwolf201mo ago

https://github.com/yung-megafone/Epstein-Files

IshKebab1mo ago

Disappointing how terrible open source OCR still is.

sorbus-251mo ago

Event details: https://web.archive.org/web/20260206040716/https://what2wear...

sorbus-251mo ago

DUBIN BREAST CENTER SECOND ANNUAL BENEFIT MONDAY, DECEMBER 10, 2012 HONORING ELISA PORT, MD, FACS AND THE RUTTENBERG FAMILY HOST CYNTHIA MCFADDEN SPECIAL MUSICAL PERFORMANCES CAROLINE JONES, K'NAAN, HALEY REINHART, THALIA, EMILY WARREN MANDARIN ORIENTAL 7:00PM COCKTAILS LOBBY LOUNGE 8:00PM DINNER AND ENTERTAINMENT MANDARIN BALLROOM FESTIVE ATTIRE

sorbus-251mo ago

Some pics from the event. Doppelgänger in the background?: https://web.archive.org/web/20121215131412/https://thaliadiv...

wtcactus1mo ago

My non political take about this gift that keeps on giving is that: PDF might seem great for the end user that is just expected to read or print the file they are given, but the technology actually sucks.

PDF is basically a prettify layer on top of the older PS that brings an all lot of baggage. The moment you start trying to do what should be simple stuff like editing lines, merging pages, change resolution of the images, it starts giving you a lot of headaches.

I used to have a few scripts around to fight some of its quirks from when I was writing my thesis and had to work daily with it. But well, it was still an improvement over Word.

direwolf201mo ago

It's meant as a printer replacement format, hence "print to PDF". It's a computer file format about equivalent to a printed document. Like a printed document, you can't just change its structure and recompile it.

j / k navigate · click thread line to collapse

201 comments

dperfect1mo ago

Nerdsnipe confirmed :)

Claude Opus came up with this script:

https://pastebin.com/ntE50PkZ

It produces a somewhat-readable PDF (first page at least) with this text output:

https://pastebin.com/SADsJZHd

(I used the cleaned output at https://pastebin.com/UXRAJdKJ mentioned in a comment by Joe on the blog page)

pests1mo ago

So it was a public event attended by 450 people:

https://www.mountsinai.org/about/newsroom/2012/dubin-breast-...

https://www.businessinsider.com/dubin-breast-center-benefit-...

Even names match up, but oddly the date is different.

elmomle1mo ago

Your links are for the inaugural (first) ball in December 2011; OP's text referred to a second annual ball in December 2012.

1 more reply

turtlesdown111mo ago

interesting, Eva Dubin was highlighted today for offering Epstein her 15 year old daughter and her friends.

She's a medical doctor, who became amnesic when on the stand for Maxwell's case

nialv71mo ago

looks like we have it. in the end it's pretty mundane...

3 more replies

notpushkin1mo ago

> It produces a somewhat-readable PDF (first page at least) with this text output

Any chance you could share a screenshot / re-export it as a (normalized) PDF? I’m curious about what’s in there, but all of my readers refuse to open it.

dperfect1mo ago

Screenshot: https://imgur.com/eWCfYYd

dperfect1mo ago

which uses this Rust zlib stream fixer: https://pastebin.com/iy69HWXC

and gives the best output I've seen it produce: https://imgur.com/itYWblh

This is using the same OCR'd text posted by commenter Joe.

daveguy1mo ago

> which is supposed to be somewhat universal in correcting similar OCR'd PDFs

Xerox would like a word.

https://news.ycombinator.com/item?id=29223815

1 more reply

the_real_cher1mo ago

This is cool!

bawolff1mo ago

Teseract supports being trained for specific fonts, that would probably be a good starting point

https://pretius.com/blog/ocr-tesseract-training-data

pyrolistical1mo ago

It decodes to binary pdf and there are only so many valid encodings. So this is how I would solve it.

1. Get an open source pdf decoder

2. Decode bytes up to first ambiguous char

3. See if next bits are valid with an 1, if not it’s an l

4. Might need to backtrack if both 1 and l were valid

By being able to quickly try each char in the middle of the decoding process you cut out the start time. This makes it feasible to test all permutations automatically and linearly

pletnes1mo ago

You might need to backtrack a lot more, due to the intermediate compression step?

bawolff1mo ago

Sounds like a job for afl

percentcer1mo ago

jjwiseman1mo ago

Or one person types 76 pages. This is a thing people used to do, not all that infrequently. Or maybe you have one friend who will help–cool, you just cut the time in half.

wildzzz1mo ago

1 more reply

sjducb1mo ago

The first week of my PHD was accurately copying DNA sequences from an old paper into a computer file. 10 pages in total. I used OCR to make an initial version then text to speech to check it

76 pages is a couple of months of work

quuxplusone1mo ago

[1] - https://news.ycombinator.com/item?id=46906897

[2] - https://news.ycombinator.com/item?id=46916065

fragmede1mo ago

> Just get 76 people

jazzyjackson1mo ago

First, build a fanbase by streaming on Twitch.

Krutonium1mo ago

Amazon Mechanical Turk?

1 more reply

WolfeReader1mo ago

You think compelling 76 people to honestly and accurately transcribe files is something that's easy and quick to accomplish.

altairprime1mo ago

pbhjpbhj1mo ago

Captcha!

estimator72921mo ago

Friend, have you ever heard of secretaries?

legitster1mo ago

Unlike every other PDF format that has been attempted, the federal government doesn't have to worry about adoption.

gucci-on-fleek1mo ago

TIFF [2] is already occasionally used for government documents, but it's arguably more complex than PDF, so probably not a good choice for this.

[0]: https://en.wikipedia.org/wiki/Open_XML_Paper_Specification

[1]: https://en.wikipedia.org/wiki/DjVu

[2]: https://en.wikipedia.org/wiki/TIFF

Spooky231mo ago

You’re thinking about this as a nerd.

It’s not a tools problem, it’s a problem of malicious compliance and contempt for the law.

legitster1mo ago

Even the previous justice departments struggled with PDFs. The way they handled it was scrubbing all possible metadata and uploading it as images.

For example, when the Mueller reports were released with redactions, they had no searchable text or meta data because they were worried about these exact kind of data leaks.

However, vast troves of unsearchable text is not a huge win for transparency.

PDFs are just a garbage format and even good administrations struggle.

Ekaros1mo ago

I give any new document format 3 to 5 years until it ends up with similar mess. And that is if it starts well designed and limited.

derwiki1mo ago

JPEG?

legitster1mo ago

That's not really comparable - It needs to be editable and searchable.

1 more reply

recursive1mo ago

Lossy

1 more reply

ChocMontePy1mo ago

You can use the justice.gov search box to find several different copies of that same email.

The copy linked in the post:

https://www.justice.gov/epstein/files/DataSet%209/EFTA004004...

Three more copies:

https://www.justice.gov/epstein/files/DataSet%2010/EFTA02153...

https://www.justice.gov/epstein/files/DataSet%2010/EFTA02154...

Perhaps having several different versions might make it easier.

ChocMontePy1mo ago

Also, I found a different base64 encoding with a different font here:

https://www.justice.gov/epstein/files/DataSet%209/EFTA007755...

This doesn't solve the "1 & l" problem for the pdf you are looking at, but it could be useful anyway.

ChocMontePy1mo ago

And this might be a copy of the original pdf:

https://www.justice.gov/epstein/files/DataSet%2011/EFTA02702...

1 more reply

JKCalhoun1mo ago

File is gone now, hmmm…

1 more reply

tcgv1mo ago

> Then my mom wrote the following: “be careful not to get sucked up in the slime-machine going on here! Since you don’t care that much about money, they can’t buy you at least.”

pavel_lishin1mo ago

I think this was meant to be a reply to https://news.ycombinator.com/item?id=46903929 ?

tcgv1mo ago

Indeed! Thanks for pointing that out. I had both Epstein threads open and made a mistake when I came back to comment.

pimlottc1mo ago

Why not just try every permutation of (1,l)? Let’s see, 76 pages, approx 69 lines per page, say there’s one instance of [1l] per line, that’s only… uh… 2^5244 possibilities…

Hmm. Anyone got some spare CPU time?

wahern1mo ago

JKCalhoun1mo ago

Not necessarily a PDF attachment?

1 more reply

cluckindan1mo ago

On the contrary, that kind of one-off tooling seems a great fit for AI. Just specify the desired inputs, outputs and behavior as accurately as possible.

1 more reply

sznio1mo ago

>It should be much easier than that. You should should be able to serially test if each edit decodes to a sane PDF structure

that's pointed out in the article. It's easy for plaintext sections, but not for compressed sections. Didn't notice any mention of checksums.

pimlottc1mo ago

I wonder if you could leverage some of the fuzzing frameworks tools like Jepsen rely on. I’m sure there’s got to be one for PDF generation.

kalleboo1mo ago

Easy, just start a crypto currency (Epsteincoin?) based on solving these base64 scans and you'll have all the compute you could ever want just lining up

yatopifo1mo ago

Please don’t give ideas to Nvidia.

kevin_thibedeau1mo ago

Followup: pdfimages is 13x faster than pdftoppm

masfuerte1mo ago

This. Not only is it faster, the images are likely to be of better quality. If you rasterize the pages then the images will be scaled, unless you get very lucky.

chrisjj1mo ago

> it’s safe to say that Pam Bondi’s DoJ did not put its best and brightest on this

Or worse. She did.

winddude1mo ago

krupan1mo ago

deadbabe1mo ago

Released all files as is, no redactions.

chrisjj1mo ago

Sudden speedy immediate didn't happen.

If I was Pam? I wouldn't have been.

If she was me, start earlier, hire better, end later.

eek21211mo ago

I mean, the internet is finding all her mistakes for her. She is actually doing alright with this. Crowdsource everything, fix the mistakes. lol.

TSiege1mo ago

This would be funnier if it wasn’t child porn being unredacted by our government

3 more replies

helterskelter1mo ago

I wonder if this could be intentional. If the datasets are contaminated with CSAM, anybody with a copy is liable to be arrested for possession.

2 more replies

dagi3d1mo ago

the issue is that mistakes can't be fixed in the sense once they are discovered, it doesn't matter if they are eventually redacted

chrisjj1mo ago

Let's see her sued for leaking PII. Here in Europe, she'd be mincemeat.

1 more reply

rockskon1mo ago

Yeah - they'll take these lessons learned for future batches of releases.

rcakebread1mo ago

Sicko.

bushbaba1mo ago

This proves my paranoia that you should print and rescan redactions. That or do screenshots of the pdf redacted and convert back to a pdf

Snoozus1mo ago

this would not have helped here

phanimahesh1mo ago

How would that help in this case?

velaia1mo ago

Bummer that it's not December - the https://www.reddit.com/r/adventofcode/ crows would love this puzzle

nubg1mo ago

Wait would this give us the unredacted PDFs?

ryanSrich1mo ago

That's the idea yeah. There are other people actively working on this. You can follow vx-underground on twitter. They're tracking it.

poyu1mo ago

I think it's the PDF files that were attached to the emails, since they're base64 encoded.

sznio1mo ago

alhamdulillah231mo ago

Got it.

Page 1: https://imgur.com/a/jwgu9uH

Page 2: https://imgur.com/a/4Zi3bkk

Use this: https://github.com/KoKuToru/extract_attachment_EFTA00400459

iwontberude1mo ago

This one is irresistible to play with. Indeed a nerd snipe.

netsharc1mo ago

The recipient is also named in there...

RajT881mo ago

There's potentially a lot of files attached and printed out in this fashion.

The search on the DOJ website (which we shouldn't trust), given the query: "Content-Type: application/pdf; name=", yields maybe a half dozen or so similarly printed BASE64 attachments.

notenlish1mo ago

There's 70 results that come out when searching for "application/pdf" on the doj website

1 more reply

linuxguy21mo ago

Love this, absolutely looking forward to some results.

Evidlo1mo ago

I took at stab at training Tesseract and holy jeebus is their CLI awful. Just an insanely complicated configuration procedure.

subscribed1mo ago

Gods, I had a flashback just from you mentioning that.

I had a reasonably simple problem to solve, slightly weird font and some 10 words in English (I actually only missed one or two blocks for missing letters to cover all I needed).

After a couple of days having almost everything (?) I just surrendered. This seems to be intentionally hostile. All the docs scattered across several repositories, no comprehensive examples, etc.

Absolutely awful piece of software from this end (training the last gen).

queenkjuul1mo ago

I'm only here to shout out fish shell, a shell finally designed for the modern world of the 90s

FarmerPotato1mo ago

If only Base64 had used a checksum.

zahlman1mo ago

"had used"? Base64 is still in very common use, specifically embedded within JSON and in "data URLs" on the Web.

bahmboo1mo ago

"had" in the sense of when it was designed and introduced as a standard

ks20481mo ago

I wonder if jmail (https://www.jmail.world/) has worked on this?

I tried to find the message in this blog post, but couldn't. (don't see how to search by date).

blindriver1mo ago

tclancy1mo ago

It really doesn’t matter which foot you use to step on your own dick. This could not have been more mishandled if they gave it to an actual snake.

rexpop1mo ago

"On the one hand the chef gets shit for taking too long, and then on another for undercooked, badly plated dishes."

Incompetence is incompetence.

rapind1mo ago

It's really really hard to give them the benefit of the doubt at this point.

Rebelgecko1mo ago

thereisnospork1mo ago

Considering the justice to document ratio that's kind of on them regardless.

subscribed1mo ago

It's pretty clear who they should be reacting (victims/minors) and who they shouldn't (perpetrators).

They wasted months erasing Trump from that instead. So it's on them.

krupan1mo ago

Government is bad at stuff, and more news at 11

hypeatei1mo ago

zahlman1mo ago

> …but good luck getting that to work once you get to the flate-compressed sections of the PDF.

A dynamic programming type approach might still be helpful. One version or other of the character might produce invalid flate data while the other is valid, or might give an implausible result.

yunnpp1mo ago

Time to flex those Leetcode skills.

winddude1mo ago

here's another few to decode,

https://www.justice.gov/epstein/files/DataSet%2010/EFTA01804...

https://www.justice.gov/epstein/files/DataSet%209/EFTA007755...

https://www.justice.gov/epstein/files/DataSet%209/EFTA004349...

and than this one judging by the name of the file (hanna something) and content of the email:

"Here is my girl, sweet sparkling Hanna=E2=80=A6! I am sure she is on Skype "

maybe more sinister (so be careful, i have no ideas what the laws are if you uncover you know what trump and Epstein were into)...

https://www.justice.gov/epstein/files/DataSet%2011/EFTA02715...

Enhaj121mo ago

Regarding EFTA00434905

I tried and got alot of errors, cant seem to fix it, due to corruption.

https://www.docfly.com/editor/fa3bcb1fa9e8d2629b32/v9r21qsju...

Tried to get AI to guess the remaining text: https://pastebin.com/Z9X2d510

netsharc1mo ago

Geezus, with the short CV in your profile, you couldn't tell an LLM to decode "filename=utf-8"CV%5F%5F%5FHanna%5FTr%C3%A4ff%5F.pdf"? That's not "Bouveng".

winddude1mo ago

not sure how I missed the url encoding. yea, fuck not sure I want to decode that PDF, and their's a high probability that that's a victims name.

Wonder why there's so many random case files in the files.

Snoozus1mo ago

this one has a better font, might be a simple copy&paste job

winddude1mo ago

eek21211mo ago

Cool article, however.

misja1111mo ago

1 more reply

SomaticPirate1mo ago

direwolf201mo ago

https://github.com/yung-megafone/Epstein-Files

IshKebab1mo ago

Disappointing how terrible open source OCR still is.

sorbus-251mo ago

Event details: https://web.archive.org/web/20260206040716/https://what2wear...

sorbus-251mo ago

Some pics from the event. Doppelgänger in the background?: https://web.archive.org/web/20121215131412/https://thaliadiv...

wtcactus1mo ago

I used to have a few scripts around to fight some of its quirks from when I was writing my thesis and had to work daily with it. But well, it was still an improvement over Word.

direwolf201mo ago

j / k navigate · click thread line to collapse