Claude Opus came up with this script:
It produces a somewhat-readable PDF (first page at least) with this text output:
(I used the cleaned output at https://pastebin.com/UXRAJdKJ mentioned in a comment by Joe on the blog page)
https://www.mountsinai.org/about/newsroom/2012/dubin-breast-...
https://www.businessinsider.com/dubin-breast-center-benefit-...
Even names match up, but oddly the date is different.
She's a medical doctor, who became amnesic when on the stand for Maxwell's case
>Pressed about gaps in her memory, Dubin told the court: "It's very hard for me to remember anything far back and sometimes I can't remember things from last month. My family notices it. I notice it."
Any chance you could share a screenshot / re-export it as a (normalized) PDF? I’m curious about what’s in there, but all of my readers refuse to open it.
which uses this Rust zlib stream fixer: https://pastebin.com/iy69HWXC
and gives the best output I've seen it produce: https://imgur.com/itYWblh
This is using the same OCR'd text posted by commenter Joe.
Xerox would like a word.
https://news.ycombinator.com/item?id=29223815
Point being, "correcting" to "correct looking" may be worse than just accepting errors. Errors are often clearly identified by humans as a nonsense word. "Correcting" OCR can result in plausible, but wrong results that are more difficult for the human in the loop to identify.
1. Get an open source pdf decoder
2. Decode bytes up to first ambiguous char
3. See if next bits are valid with an 1, if not it’s an l
4. Might need to backtrack if both 1 and l were valid
By being able to quickly try each char in the middle of the decoding process you cut out the start time. This makes it feasible to test all permutations automatically and linearly
76 pages is a couple of months of work
I consider myself fairly normal in this regard, but I don't have 76 friends to ask to do this, so I don't know how I'd go about doing this. Post an ad on craigslist? Fiverr? Seems like a lot to manage.
Also look up double/triple data-entry systems, where you have multiple people enter the data and then flag and resolve differences. Won't protect you from your staff banding together to fuck you over with maliciously bad data, but it's incredibly effective to ensure people were Actually Working Their Blocks under healthy circumstances.
Unlike every other PDF format that has been attempted, the federal government doesn't have to worry about adoption.
DjVu [1] would be another option. It has really good open source tooling available, but it supports substantially less features than PDF, making it not really suitable as a drop-in replacement. The format is relatively simple though, so redaction should be fairly doable.
TIFF [2] is already occasionally used for government documents, but it's arguably more complex than PDF, so probably not a good choice for this.
[0]: https://en.wikipedia.org/wiki/Open_XML_Paper_Specification
It’s not a tools problem, it’s a problem of malicious compliance and contempt for the law.
For example, when the Mueller reports were released with redactions, they had no searchable text or meta data because they were worried about these exact kind of data leaks.
However, vast troves of unsearchable text is not a huge win for transparency.
PDFs are just a garbage format and even good administrations struggle.
The copy linked in the post:
https://www.justice.gov/epstein/files/DataSet%209/EFTA004004...
Three more copies:
https://www.justice.gov/epstein/files/DataSet%2010/EFTA02153...
https://www.justice.gov/epstein/files/DataSet%2010/EFTA02154...
https://www.justice.gov/epstein/files/DataSet%2010/EFTA02154...
Perhaps having several different versions might make it easier.
https://www.justice.gov/epstein/files/DataSet%209/EFTA007755...
This doesn't solve the "1 & l" problem for the pdf you are looking at, but it could be useful anyway.
https://www.justice.gov/epstein/files/DataSet%2011/EFTA02702...
I'm lucky to have parents with strong values. My whole life they've given me advice, on the small stuff and the big decisions. I didn't always want to hear it when I was younger, but now in my late thirties, I'm really glad they kept sharing it. In hidhsight I can see the life-experience / wisdom in it, and how it's helped and shaped me.
Hmm. Anyone got some spare CPU time?
Someone who made some progress on one Base64 attachment got some XMP metadata that suggested a photo from an iPhone. Now I don't know if that photo was itself embedded in a PDF, but perhaps getting at least the first few hundred bytes decoded (even if it had to be done manually) would hint at the file-type of the attachment. Then you could run your tests for file fidelity.
that's pointed out in the article. It's easy for plaintext sections, but not for compressed sections. Didn't notice any mention of checksums.
Followup: pdfimages is 13x faster than pdftoppm
Or worse. She did.
More likely it's just an oversight, but it could also be CYA for dragging their feet, like "you rushed us, and look at these victims you've retraumatized". There are software solutions to find nudity and they're quite effective.
Page 1: https://imgur.com/a/jwgu9uH
Page 2: https://imgur.com/a/4Zi3bkk
Use this: https://github.com/KoKuToru/extract_attachment_EFTA00400459
The recipient is also named in there...
The search on the DOJ website (which we shouldn't trust), given the query: "Content-Type: application/pdf; name=", yields maybe a half dozen or so similarly printed BASE64 attachments.
There's probably lots of images as well attached in the same way (probably mostly junk). I deleted all my archived copies recently once I learned about how not-quite-redacted they were. I will leave that exercise to someone else.
I had a reasonably simple problem to solve, slightly weird font and some 10 words in English (I actually only missed one or two blocks for missing letters to cover all I needed).
After a couple of days having almost everything (?) I just surrendered. This seems to be intentionally hostile. All the docs scattered across several repositories, no comprehensive examples, etc.
Absolutely awful piece of software from this end (training the last gen).
I tried to find the message in this blog post, but couldn't. (don't see how to search by date).
Incompetence is incompetence.
It's really really hard to give them the benefit of the doubt at this point.
They wasted months erasing Trump from that instead. So it's on them.
A dynamic programming type approach might still be helpful. One version or other of the character might produce invalid flate data while the other is valid, or might give an implausible result.
https://www.justice.gov/epstein/files/DataSet%2010/EFTA01804...
https://www.justice.gov/epstein/files/DataSet%209/EFTA007755...
https://www.justice.gov/epstein/files/DataSet%209/EFTA004349...
and than this one judging by the name of the file (hanna something) and content of the email:
"Here is my girl, sweet sparkling Hanna=E2=80=A6! I am sure she is on Skype "
maybe more sinister (so be careful, i have no ideas what the laws are if you uncover you know what trump and Epstein were into)...
https://www.justice.gov/epstein/files/DataSet%2011/EFTA02715...
[Above is probably a legit modeling CV for HANNA BOUVENG, based on, https://www.justice.gov/epstein/files/DataSet%209/EFTA011204..., but still creepy, and doesn't seem like there's evidence of her being a victim]
I tried and got alot of errors, cant seem to fix it, due to corruption.
https://www.docfly.com/editor/fa3bcb1fa9e8d2629b32/v9r21qsju...
Tried to get AI to guess the remaining text: https://pastebin.com/Z9X2d510
Anyway searching for the email sender's name, there's a screenshot of an email of hers in English offering him a girl as an assistant who is "in top physical shape" (probably not this Hanna girl). That's fucking creepy: https://www.expressen.se/nyheter/varlden/epsteins-lofte-till...
Wonder why there's so many random case files in the files.
Cool article, however.
PDF is basically a prettify layer on top of the older PS that brings an all lot of baggage. The moment you start trying to do what should be simple stuff like editing lines, merging pages, change resolution of the images, it starts giving you a lot of headaches.
I used to have a few scripts around to fight some of its quirks from when I was writing my thesis and had to work daily with it. But well, it was still an improvement over Word.