Classifying all of the pdfs on the internet (opens in new tab)

(snats.xyz)

296 pointsNydhal1y ago110 comments

110 comments

Did some similar work with similar visualizations ~2009, on ~5.7M research articles (PDFs, private corpus) from scientific publishers Elsevier, Springer:

Newton, G., A. Callahan & M. Dumontier. 2009. Semantic Journal Mapping for Search Visualization in a Large Scale Article Digital Library. Second Workshop on Very Large Digital Libraries at the European Conference on Digital Libraries (ECDL) 2009. https://lekythos.library.ucy.ac.cy/bitstream/handle/10797/14...

I am the first author.

j_bum1y ago

Nice article, thanks for sharing.

I can imagine mining all of these articles was a ton of work. I’d be curious to know how quickly the computation could be done today vs. the 13 hour 2009 benchmark :)

Nowadays people would be slamming those data through UMAP!

gnewton771y ago

Thanks! Yes, I'm sure using modern hardware and modern techniques would improve both computation time and results.

Loughla1y ago

How do you decide who is listed first? And does the ampersand symbolize something that the word and doesn't, or is that just citation style?

j_bum1y ago

In biomedical research or tangential fields, author order generally follows these guidelines:

First author(s): the individual(s) who organized and conducted the study. Typically there is only a single first author, but nowadays there are often two first authors. This is because the amount of research required to generate “high impact” publications simply can’t be done by a single person. Typically, the first author is a Ph.D. student or lab scientist.

Middle authors: Individuals that provide critical effort, help, feedback, or guidance for the study and publication of the research. Different fields/labs have varying stringencies for what is considered “middle authorship worthy”. In many labs, simply being present and helping with the research warrants authorship. In other labs, you need to contribute a lot of energy to the project to be included as an author.

Senior author(s): The primary investigators (PI’s) or lead researchers that run the lab that conducted and published the study. The senior authors are typically the ones that acquire funding and oversee all aspects of the published research. PI’s have varying degrees of hands-on management.

There is some variation in whether the central research question for a manuscript is developed by the first vs. the senior author, but usually it’s the senior author. Also, the first and senior authors typically write the manuscript and seek edits/feedback from middle authors. In other cases, there can be dedicated writers that write the manuscript, who sometimes do/don’t get middle authorship. A main takeaway is: the general outline I’ve provided above is not strictly adhered to.

I’ll take some liberty to apply this outline to this article at hand:

First Author: G. Newton (OP). The scientist who mostly likely conducted all of the data mining and analysis. He likely wrote the article as well.

Middle Author: A. Callahan. It seems like this author was a grad student at the time the article was written. She likely performed essential work for the paper’s publication. This could’ve been: helping with the analysis, data mining, or ideation.

Senior Author: M. Dumontier. A data science professor, now at Maastricht U. He’s a highly cited scientist!

Lastly… if you check out the acknowledgements, you can see three additional names. These people likely helped with setting up compute access, editing, or general ideation.

This is a cool manuscript! Hopefully this overview isn’t TMI and provides some insight into the biomedical/data science publication process.

gnewton771y ago

This is a fairly accurate account of the roles played for this paper. I came up with the original idea, wrote all the code, did the analysis, and wrote the paper, with my colleagues providing input along the way. I did this when I was a researcher at the National Research Council Canada. Thanks! :-)

I just realized that the pre-print of the paper is available at the NRC's Publications Archive: https://nrc-publications.canada.ca/eng/view/object/?id=63e86...

a123b456c1y ago

Agree with this but it does not apply to all fields. Economists have a norm of alphabetizing author names unless the contributions were very unevenly distributed. That way authors can avoid squabbling over contributions.

1 more reply

minimaxir1y ago

One of the now-underdiscussed features of embeddings is that you can indeed use any existing statistical modeling techniques on them out of the box, and as a bonus avoid the common NLP preprocessing nuances and pitfalls (e.g. stemming) entirely.

This post is a good example on why going straight to LLM embeddings for NLP is a pragmatic first step, especially for long documents.

throw109201y ago

You can apply statistical techniques to the embeddings themselves? How does that work?

mkl1y ago

You can apply statistical techniques to anything you want. Embeddings are just vectors of numbers which capture some meaning, so statistical analysis of them will work fine.

throw109201y ago

Don't most statistical techniques rely on specific structure in the spaces containing the objects they operate on, in order to be useful?

1 more reply

snats1y ago

Hi! Author here, I wasn't expecting this to be at the top of HN, AMA

dangoodmanUT1y ago

Great post, I'm wondering if there are resources you'd suggest to learn this kind of analysis?

I dug through the code and it seemed like a ton of things I'm not familiar with, probably a lot of techniques I don't know rather than python ofc.

bprew1y ago

Hi snats, great article. You mention the accuracy of the various techniques you used, could you explain more about how you calculated the accuracy? Were the pdfs already categorized?

Thanks!

snats1y ago

hi! i used the average accuracy over the entire dataset made originally made by the llm

whistle6501y ago

Interesting read with lots of good detail, thank you. A comment: if you are balancing the classes when you do one vs all binary training, and then use the max probability for inference, your probabilities might not be calibrated well, which could be a problem. Do you correct the probabilities before taking the argmax?

llm_trw1y ago

Back in 2006 there were multiple 1tb collections of textbooks as torrents. I imagine the size and number has only grown since then.

namrog841y ago

That was before hoarding and building questionable businesses around them became a thing. I remember it being really easy to find textbooks, solution manuals, and related pdf and other stuff as late as 2008 far easier than 6-8 years later.

The main difference were sites like chegg and many other sites started slurping them up to resell in some way.

loa_in_1y ago

It doesn't take away the torrents, no?

CaptainFever1y ago

One of my pet peeves is the way people use words like slurp, hoover, take, vaccuum, suck up, or steal, when in reality they mean copy.

I mean if Chegg manages to sell something you can get for free, then all the more power to them lol. Though we could probably do more to educate the younger generation on the magic of torrents. Ignoring angry textbook publishers, of course.

1 more reply

nativeit1y ago

I personally have about 350GB worth of old service manuals, data sheets, catalogs, and periodicals. Mostly related to electronics and engineering. All from torrent sources from ~2-years ago (when I wanted to mess with GraphQL and some OSR resources).

qingcharles1y ago

Anna's Archive has a lot of multi-multi-terabyte torrents if you want them.

buildbot1y ago

I have 20-40TB (pre-dedup) of PDFs - 8TB is a lot but not even close to the total number of PDFs available.

sporedro1y ago

Just wondering what do you collect? Is it mainly mirroring things like libgen?

I have a decent collection of ebooks/pdfs/manga from reading. But I can’t imagine how large a 20TB library is.

reaperducer1y ago

Just wondering what do you collect?

I can't speak for the OP, but you can buy optical media of old out-of-print magazines scanned as PDFs.

I bought the entirety of Desert Magazine from 1937-1985. It arrived on something like 15 CD-ROMS.

I drag-and-dropped the entire collection into iBooks, and read them when I'm on the train.

(Yes, they're probably on archive.org for free, but this is far easier and more convenient, and I prefer to support publishers rather than undermine their efforts.)

buildbot1y ago

Yep, a good bit of them are from sources like this :)

buildbot1y ago

No torrents at all in this data, all publicly available/open access. Mostly scientific pdfs, and a good portion of those are scans not just text. So the actual text amount is probably pretty low compared to the total. But still, a lot more than 8TB of raw data out there. I bet the total number of PDFs is close to a petabyte if not more.

tylerflick1y ago

> I bet the total number of PDFs is close to a petabyte if not more.

That's a safe bet. I'v seen PDF's in the GBs from users treating it like a container format (which it is).

1 more reply

mehulashah1y ago

Care to make it publicly available? Or is that not permitted on your dataset? Certainly, there’s a lot more PDFs out there than 8TB. I bet there’s a lot of redundancy in yours, but doesn’t dedup well because of all the images.

qingcharles1y ago

I have >10TB of magazines I've collected so far, and I could probably source another 50TB if I had the time. I'm working on uploading them, but I've had too much on my plate lately: https://en.magazedia.wiki/

There is a significant issue with copyright, though. I'll remove anything with a valid DMCA, but 99.9% of the world's historical magazine issues are now in IP limbo as their ownership is probably unknown. Most of the other .1% aren't overly concerned as distribution is their goal and their main income is advertising, not sales.

buildbot1y ago

I think that would be legally iffy for the stuff like collections of old magazines that were purchased on CD/DVD and such :/

guiomie1y ago

Interesting and fun article! I've been experimenting with various LLMs/GenAI solutions to extract tabular data from PDFs with underwhelming results. It seems like they are good at extracting strings of text and summarizing (e.g what was the total price? when was this printed?) but extracting reliably into a CSV has a decent margin of error.

abhi_p1y ago

Disclosure: I'm an employee.

Give the Aryn partitioning service a shot: https://www.aryn.ai/post/announcing-the-aryn-partitioning-se...

We recently released it and we've a few examples here: https://sycamore.readthedocs.io/en/stable/aryn_cloud/get_sta... that show you how to turn the tabular data from the pdf into a pandas dataframe(which you can then turn into csv).

josh-sematic1y ago

Very cool! At Airtrain we’ve also found embeddings can be very valuable for building classification models. If you’re looking to play around with a large amount of text and embeddings we actually recently deduped and embedded all of fineweb-edu (also mentioned in the article) and put the resulting dataset on Hugging Face: https://huggingface.co/datasets/airtrain-ai/fineweb-edu-fort...

ned_at_codomain1y ago

This is a really cool idea, thanks for sharing. I don't have that much free time these days, but I was thinking of trying a similar-but-different project not too long ago.

I wanted to make a bit of an open source tool to pull down useful time series data for the social sciences (e.g. time series of social media comments about grocery prices). Seems like LLMs have unlocked all kinds of new research angles that people aren't using yet.

I may steal some of your good ideas if I ever get to work on that side project :)

sireat1y ago

Nice work! You've taken multiple approaches similar to what I sometimes do at the national library, I've used all kind of embeddings -> classifiers / LDA.

Curious on your prompt: https://github.com/snat-s/m/blob/main/classify_metadata/prom...

Wouldn't this be basically prompting to classify by the type of URL?

mehulashah1y ago

Classification is just a start. Wondering if it's worth doing something more -- like turning all of the text into Markdown or HTML? Would anyone find that interesting?

Treesrule141y ago

There are a lot of webcrawlers where the chief feature is turning the website into markdown, I don't quite understand what they are doing for me thats useful since I can just do something like `markdownify(my_html)` or whatever, all this to say is that I wouldn't find this useful, but also clearly people think this is a useful feature as part of an LLM pipeline.

loa_in_1y ago

You don't want the footer or navigation in the output. Ideally you want the main content of the page, if it exists. How do you assign header level if they're only differentiated by CSS left-margin in a variety of units? How do you interpret documents that render properly but are hardly correct HTML?

Treesrule141y ago

Thanks, I guess, none of that stuff seemed super useful to cut systematically, but I'm gonna run some tests.

pxdm1y ago

My first thought on seeing the PCA embeddings scatterplot was "I wonder what pdfs are at the centre of those two clusters?" The most typical pdfs on the internet.

muratsu1y ago

I would have expected the finetuned model to perform much better. Would be curious to see the performance with other models

Thaxll1y ago

First you need a good PDF library :/

excalibur1y ago

> How would you classify all the pdfs in the internet?

Definitely as 'hot dog' or 'not a hot dog'.

Mindey1y ago

Whoever said there "internet," they fail to grasp how big the internet really is.

autokad1y ago

would be interesting to see if they tried LDA (latent direchelet allocation) topics

layer81y ago

I would have expected categories for product brochures and product manuals.

TuringNYC1y ago

Ive been playing with https://www.aryn.ai/ for Partitioning. Curious if anyone has tried these tools for better data extraction from PDFs. Any other suggestions?

(I'm a bit disappointed that most of the discussion is about estimating the size of PDFs on the internet, I'd love to hear more about different approaches to extracting better data from the PDFs.)

dwynings1y ago

https://www.sensible.so/

Full disclosure: I'm an employee

niels_bom1y ago

Typo: “meats the eye”

byteknight1y ago

This seems like cool work but with a ton of "marketing hype speak" that immediately gets watered down by the first paragraph.

Ordering of statements.

1. (Title) Classifying all of the pdfs on the internet

2. (First Paragraph) Well not all, but all the PDFs in Common Crawl

3. (First Image) Well not all of them, but 500k of them.

I am not knocking the project, but while categorizing 500k PDFs is something we couldnt necessarily do well a few years ago, this is far from "The internet's PDFs".

schneehertz1y ago

Moreover, the classification was not done on 500,000 PDF files themselves, but rather on the metadata of those 500,000 PDFs.

1-61y ago

Overpromise with headline, underdeliver on details.

afh11y ago

Interesting read, I did not know about Common Crawl. I feel like RTBF is kind of a lost battle these days with more and more crawlers for AI and whatnot. Once on the internet there is no way back, for better or for worse. This tangent aside, 8TB is really not a lot of data, it's just 8 consumer-grade 1TB hard drives. I find it hard to believe this is "the largest corpus of PDFs online", maybe the largest public one. Not sure how representative it is of "the whole internet".

ziddoap1y ago

>I feel like RTBF is kind of a lost battle these days

For those of us who aren't familiar with this random acronym, I think RTBF = right to be forgotten.

ronsor1y ago

RTBF was a ludicrous concept before AI and these new crawlers.

Only EU bureaucracts would have the hubris to believe you could actually, comprehensively remove information from the Internet. Once something is spread, it is there, forever.

gsck1y ago

RTBF isn't about having your information wiped from the internet. Its a safe assumption any public information about you is completely out of your control as soon as its public.

RTBF is about getting companies to get rid of any trace of you so they cannot use that data, not removing all traces about you across the internet.

fsckboy1y ago

>RTBF isn't about having your information wiped from the internet.

your take is misleading enough to be considered wrong. It's "don't use public information about me in search engines, I don't want people to find that information about me", not simply "don't use my information for marketing purposes"

https://en.wikipedia.org/wiki/Right_to_be_forgotten

first paragraph of the article: The right to be forgotten (RTBF) is the right to have private information about a person be removed from Internet searches and other directories in some circumstances. The issue has arisen from desires of individuals to "determine the development of their life in an autonomous way, without being perpetually or periodically stigmatized as a consequence of a specific action performed in the past". The right entitles a person to have data about them deleted so that it can no longer be discovered by third parties, particularly through search engines.

1 more reply

gwervc1y ago

> Once something is spread, it is there, forever.

Really depends on the content. Tons of websites are going down everyday, link rot is a real thing. Internet archive or people don't save nearly everything.

Something I should do more often is saving mhtml copies of webpages I find interesting.

dotancohen1y ago

  > Something I should do more often is saving mhtml copies of webpages I find interesting.

They consume so much disc space. I wish that their was some intermediate format that would have a file size only two orders of magnitude larger than the webpage text, yet provide enough formatting to be useful.

oersted1y ago

Correct me if I'm wrong, but I always took RTBF to mean you have the right to be forgotten by any specific service provider: that you can request they delete the data they have that relates to you, and that they forward the request to any subprocessors. That's fairly reasonable and doable, it is enforced by GDPR and a number of other wide-reaching laws already, and it is a relatively common practice nowadays to allow users to make such requests with certain guarantees.

It never meant that you have the right to ask "the Internet" as a whole to scrub you from all possible records, that's indeed ludicrous. And if someone took it to mean that and they were pushing for it, they were just confused, no serious law ever proposed that.

miohtama1y ago

There is a whole business sector for ”Online reputation fixers”

https://www.mycleanslate.co.uk/

What they usually do

- Spam Google with the name to bury content

- Send legal threads and use GDPR

They have legit use cases, but are often used by convicted or shady businessmen, politicians, and scammers to hide their earlier misdeeds.

PaulHoule1y ago

Also a neurodivergent person I feel very much discriminated against when a whole continent weaponizes the law to protect scam artists who weaponize their social skills to steal from people. It makes me feel unwelcome going to Europe and for all the handwriting about Europe’s poor economic performance it is yet another explanation of why Europe is falling behind — their wealth is being stolen by people who can’t be held accountable.

jononor1y ago

Which scam artists are you referring to?

1 more reply

tivert1y ago

> RTBF

Right to be forgotten, not the Belgian public service broadcaster (https://en.wikipedia.org/wiki/RTBF)?

lkuty1y ago

Living in Belgium, I first thought that it was about the TV/radio service. Never saw the acronym R.T.B.F.

Propelloni1y ago

Doesn't sound like a lot, but where I am now we routinely work on very large infrastructure projects and the plans, documents and stuff mostly come as PDF. We are talking of thousands of documents, often with thousands of pages, per project and even very big projects almost never break 20 GB.

If you like, you could say, PDF are information dense, but data sparse. After all it is mostly white space ;)

IggleSniggle1y ago

They often aren't like you're describing, though. For example, pdfs with high res images embedded that are drafts of future book or pamphlets prints. These can be hundreds of Mbs for a single pdf with less than 100 pages, and are so common in marketing departments that it's hard to imagine that you could fit anywhere close to all the pdfs on 8TB.

Propelloni1y ago

True, we get plenty of high-res pictures of film in PDF here and some of them are ridiculously large, easily approaching gigabyte sizes, like you said. But that's more a problem of the user creating the PDF than inherent to PDFs. A raw 36 megapixels (our fancy 4K displays are only 8.3 megapixels, for comparison) picture reproduction of an ISO 400 film takes only about 70 MB, which tells us that something went wrong in the transfer if a PDF containing 10 pages of them cracks 1 GB.

So, yeah, there are these monsters that send even beefy computers thrashing. But in my experience something in the creation process went wrong and it is appallingly common for a trade where PDFs are the go-to transfer format (I'm looking at you AutoCAD users!) I'd guess that the archive is doing the same we do, reprocess them for sensible results and store them. I assume you think the archive does not and then I'd agree with you. One determined civil engineer with AutoCAD can fill 8 TB in a week ;)

daemonologist1y ago

I'm doing some work for a company that handles scanned documents (PDFs which are purely images) and they accumulate about 15 TB / year. Of course the actual amount of information is relatively small, just inflated by being scanned. Probably 80% of them were typed up, printed, and then scanned or faxed, and of course the first thing we do is OCR them to try to recover the original text and formatting...

yfontana1y ago

I've been doing some work for an infrastructure company as well. They have a total of about 1 billion pages of PDF documents in their archives. If we assume even just 30 KB per page (which is quite low, all the PDFs I just randomly checked were higher, sometimes quite a bit so), that's already 30 TB of PDFs, just for that one company with 1B in annual sales.

SnowflakeOnIce1y ago

The common crawl only pulls documents less than a small limit (1MiB last I checked). Without special handling in this project, bigger documents than that would be missing.

So indeed, not representative of the whole Internet.

ziddoap1y ago

From the article:

>Specifically, when Common Crawl gets to a pdf, it just stores the first megabyte of information and truncates the rest.

This is where SafeDocs or CC-MAIN-2021-31-PDF-UNTRUNCATED enters the picture. This corpus was originally created by the DARPA SafeDocs program and what it did was refetch all the different pdfs from a snapshot of Common Crawl to have untruncated versions of them.

seanw2651y ago

Tangentially related, I was once handed a single PDF between 2 and 5 GBs in size and asked to run inference on it. This was the result of a miscommunication with the data provider, but I think it's funny and almost impressive that this file even exists.

deweller1y ago

Is it possible that the 8 TB is just the extracted text?

tokai1y ago

No, the Safedocs dataset is unprocessed pdfs.

tokai1y ago

Yeah 8TB is really tiny. Google scholar was estimated to index 160.000.000 pdfs in 2015.[0] If we assume that a third of those are not behind paywalls, and average pdf size is 1mb, its ends up as something above 50TB of documents. Almost ten years later the number of available pdfs of just scholarly communication should be substantially higher.

[0] https://link.springer.com/article/10.1007/s11192-015-1614-6

elorant1y ago

Anna's archive has some 300M pdfs.

tokai1y ago

We're talking about the open web here. But yeah that's the point, the dataset is unreasonably small.

moralestapia1y ago

Libgen size is ~33TB so, no, it's not "the largest corpus of PDFs online".

(Although you could argue libgen is not really "public" in the legal sense of the word, lol).

Disregarding that, the article is great!

(edit: why would someone downvote this, HN is becoming quite hostile lately)

tecleandor1y ago

I think Libgen is ~100TB, and the full Anna's Archive is near a PB.

They all probably contain lots of duplicates but...

https://annas-archive.se/datasets

matthewaveryusa1y ago

It's being down voted because your number is really off. Libgen's corpus is 100+ TB

simonw1y ago

8TB - ~8,000GB - is more than 33GB.

moralestapia1y ago

Whoops, typo!

But that's what the comments are for, not the downvotes.

dotancohen1y ago

I upvoted this comment because, though the number is wrong, it proves the point. The fact that the correct number proves the point even more, is a reason _not_ to downvote the comment.

mellosouls1y ago

I haven't downvoted you but it is presumably because of your hasty typing or lack of proofreading/research.

33TB (first google result from 5 years ago) not 33GB. Larger figures from more recently.

moralestapia1y ago

>hasty typing or lack of proofreading/research

This is exactly what I meant with "HN is becoming quite hostile"

* I brought up something I looked up to support GP's argument.

* The argument is correct.

* I do it in good faith.

* G is literally next to T.

* I even praise the article, while at it.

"Oh, but you made a typo!".

Good luck, guys. I'm out.

PS. I will give my whole 7 figure net worth, no questions asked, transferred immediately to any account of their choice, to anyone here who has not ever made a typo in their life.

7 more replies

reaperducer1y ago

(edit: why would someone downvote this, HN is becoming quite hostile lately)

Also, there are browser extensions that will automatically downvote and/or hide HN comments that use words like "lol," or start with "So..." or include any of a number of words that the user considers indicative of low-grade content.

ks20481y ago

    I don’t have 8TB laying around, but we can be a bit more clever.... In particular I cared about a specific column called url. I really care about the urls because they essentially tell us a lot more from a website than what meats the eye.

I'm I correct that it is only only using the URL of the PDF to do classification? Maybe still useful, but that's quite a different story than "classifying all the pdfs".

xattt1y ago

It’s just classifying the URLs if that’s the case.

The legwork to classify PDFs is already done, and the authorship of the article can go to anyone who can get a grant for a $400 NewEgg order for an 8TB drive.

j / k navigate · click thread line to collapse

110 comments

gnewton771y ago

Did some similar work with similar visualizations ~2009, on ~5.7M research articles (PDFs, private corpus) from scientific publishers Elsevier, Springer:

I am the first author.

j_bum1y ago

Nice article, thanks for sharing.

I can imagine mining all of these articles was a ton of work. I’d be curious to know how quickly the computation could be done today vs. the 13 hour 2009 benchmark :)

Nowadays people would be slamming those data through UMAP!

gnewton771y ago

Thanks! Yes, I'm sure using modern hardware and modern techniques would improve both computation time and results.

Loughla1y ago

How do you decide who is listed first? And does the ampersand symbolize something that the word and doesn't, or is that just citation style?

j_bum1y ago

In biomedical research or tangential fields, author order generally follows these guidelines:

I’ll take some liberty to apply this outline to this article at hand:

First Author: G. Newton (OP). The scientist who mostly likely conducted all of the data mining and analysis. He likely wrote the article as well.

Senior Author: M. Dumontier. A data science professor, now at Maastricht U. He’s a highly cited scientist!

Lastly… if you check out the acknowledgements, you can see three additional names. These people likely helped with setting up compute access, editing, or general ideation.

This is a cool manuscript! Hopefully this overview isn’t TMI and provides some insight into the biomedical/data science publication process.

gnewton771y ago

I just realized that the pre-print of the paper is available at the NRC's Publications Archive: https://nrc-publications.canada.ca/eng/view/object/?id=63e86...

a123b456c1y ago

1 more reply

minimaxir1y ago

This post is a good example on why going straight to LLM embeddings for NLP is a pragmatic first step, especially for long documents.

throw109201y ago

You can apply statistical techniques to the embeddings themselves? How does that work?

mkl1y ago

You can apply statistical techniques to anything you want. Embeddings are just vectors of numbers which capture some meaning, so statistical analysis of them will work fine.

throw109201y ago

Don't most statistical techniques rely on specific structure in the spaces containing the objects they operate on, in order to be useful?

1 more reply

snats1y ago

Hi! Author here, I wasn't expecting this to be at the top of HN, AMA

dangoodmanUT1y ago

Great post, I'm wondering if there are resources you'd suggest to learn this kind of analysis?

I dug through the code and it seemed like a ton of things I'm not familiar with, probably a lot of techniques I don't know rather than python ofc.

bprew1y ago

Hi snats, great article. You mention the accuracy of the various techniques you used, could you explain more about how you calculated the accuracy? Were the pdfs already categorized?

Thanks!

snats1y ago

hi! i used the average accuracy over the entire dataset made originally made by the llm

whistle6501y ago

llm_trw1y ago

Back in 2006 there were multiple 1tb collections of textbooks as torrents. I imagine the size and number has only grown since then.

namrog841y ago

The main difference were sites like chegg and many other sites started slurping them up to resell in some way.

loa_in_1y ago

It doesn't take away the torrents, no?

CaptainFever1y ago

One of my pet peeves is the way people use words like slurp, hoover, take, vaccuum, suck up, or steal, when in reality they mean copy.

1 more reply

nativeit1y ago

qingcharles1y ago

Anna's Archive has a lot of multi-multi-terabyte torrents if you want them.

buildbot1y ago

I have 20-40TB (pre-dedup) of PDFs - 8TB is a lot but not even close to the total number of PDFs available.

sporedro1y ago

Just wondering what do you collect? Is it mainly mirroring things like libgen?

I have a decent collection of ebooks/pdfs/manga from reading. But I can’t imagine how large a 20TB library is.

reaperducer1y ago

Just wondering what do you collect?

I can't speak for the OP, but you can buy optical media of old out-of-print magazines scanned as PDFs.

I bought the entirety of Desert Magazine from 1937-1985. It arrived on something like 15 CD-ROMS.

I drag-and-dropped the entire collection into iBooks, and read them when I'm on the train.

(Yes, they're probably on archive.org for free, but this is far easier and more convenient, and I prefer to support publishers rather than undermine their efforts.)

buildbot1y ago

Yep, a good bit of them are from sources like this :)

buildbot1y ago

tylerflick1y ago

> I bet the total number of PDFs is close to a petabyte if not more.

That's a safe bet. I'v seen PDF's in the GBs from users treating it like a container format (which it is).

1 more reply

mehulashah1y ago

qingcharles1y ago

buildbot1y ago

I think that would be legally iffy for the stuff like collections of old magazines that were purchased on CD/DVD and such :/

guiomie1y ago

abhi_p1y ago

Disclosure: I'm an employee.

Give the Aryn partitioning service a shot: https://www.aryn.ai/post/announcing-the-aryn-partitioning-se...

josh-sematic1y ago

ned_at_codomain1y ago

This is a really cool idea, thanks for sharing. I don't have that much free time these days, but I was thinking of trying a similar-but-different project not too long ago.

I may steal some of your good ideas if I ever get to work on that side project :)

sireat1y ago

Nice work! You've taken multiple approaches similar to what I sometimes do at the national library, I've used all kind of embeddings -> classifiers / LDA.

Curious on your prompt: https://github.com/snat-s/m/blob/main/classify_metadata/prom...

Wouldn't this be basically prompting to classify by the type of URL?

mehulashah1y ago

Classification is just a start. Wondering if it's worth doing something more -- like turning all of the text into Markdown or HTML? Would anyone find that interesting?

Treesrule141y ago

loa_in_1y ago

Treesrule141y ago

Thanks, I guess, none of that stuff seemed super useful to cut systematically, but I'm gonna run some tests.

pxdm1y ago

My first thought on seeing the PCA embeddings scatterplot was "I wonder what pdfs are at the centre of those two clusters?" The most typical pdfs on the internet.

muratsu1y ago

I would have expected the finetuned model to perform much better. Would be curious to see the performance with other models

Thaxll1y ago

First you need a good PDF library :/

excalibur1y ago

> How would you classify all the pdfs in the internet?

Definitely as 'hot dog' or 'not a hot dog'.

Mindey1y ago

Whoever said there "internet," they fail to grasp how big the internet really is.

autokad1y ago

would be interesting to see if they tried LDA (latent direchelet allocation) topics

layer81y ago

I would have expected categories for product brochures and product manuals.

TuringNYC1y ago

Ive been playing with https://www.aryn.ai/ for Partitioning. Curious if anyone has tried these tools for better data extraction from PDFs. Any other suggestions?

(I'm a bit disappointed that most of the discussion is about estimating the size of PDFs on the internet, I'd love to hear more about different approaches to extracting better data from the PDFs.)

dwynings1y ago

https://www.sensible.so/

Full disclosure: I'm an employee

niels_bom1y ago

Typo: “meats the eye”

byteknight1y ago

This seems like cool work but with a ton of "marketing hype speak" that immediately gets watered down by the first paragraph.

Ordering of statements.

1. (Title) Classifying all of the pdfs on the internet

2. (First Paragraph) Well not all, but all the PDFs in Common Crawl

3. (First Image) Well not all of them, but 500k of them.

I am not knocking the project, but while categorizing 500k PDFs is something we couldnt necessarily do well a few years ago, this is far from "The internet's PDFs".

schneehertz1y ago

Moreover, the classification was not done on 500,000 PDF files themselves, but rather on the metadata of those 500,000 PDFs.

1-61y ago

Overpromise with headline, underdeliver on details.

afh11y ago

ziddoap1y ago

>I feel like RTBF is kind of a lost battle these days

For those of us who aren't familiar with this random acronym, I think RTBF = right to be forgotten.

ronsor1y ago

RTBF was a ludicrous concept before AI and these new crawlers.

Only EU bureaucracts would have the hubris to believe you could actually, comprehensively remove information from the Internet. Once something is spread, it is there, forever.

gsck1y ago

RTBF isn't about having your information wiped from the internet. Its a safe assumption any public information about you is completely out of your control as soon as its public.

RTBF is about getting companies to get rid of any trace of you so they cannot use that data, not removing all traces about you across the internet.

fsckboy1y ago

>RTBF isn't about having your information wiped from the internet.

https://en.wikipedia.org/wiki/Right_to_be_forgotten

1 more reply

gwervc1y ago

> Once something is spread, it is there, forever.

Really depends on the content. Tons of websites are going down everyday, link rot is a real thing. Internet archive or people don't save nearly everything.

Something I should do more often is saving mhtml copies of webpages I find interesting.

dotancohen1y ago

  > Something I should do more often is saving mhtml copies of webpages I find interesting.

oersted1y ago

miohtama1y ago

There is a whole business sector for ”Online reputation fixers”

https://www.mycleanslate.co.uk/

What they usually do

- Spam Google with the name to bury content

- Send legal threads and use GDPR

They have legit use cases, but are often used by convicted or shady businessmen, politicians, and scammers to hide their earlier misdeeds.

PaulHoule1y ago

jononor1y ago

Which scam artists are you referring to?

1 more reply

tivert1y ago

> RTBF

Right to be forgotten, not the Belgian public service broadcaster (https://en.wikipedia.org/wiki/RTBF)?

lkuty1y ago

Living in Belgium, I first thought that it was about the TV/radio service. Never saw the acronym R.T.B.F.

Propelloni1y ago

If you like, you could say, PDF are information dense, but data sparse. After all it is mostly white space ;)

IggleSniggle1y ago

Propelloni1y ago

daemonologist1y ago

yfontana1y ago

SnowflakeOnIce1y ago

The common crawl only pulls documents less than a small limit (1MiB last I checked). Without special handling in this project, bigger documents than that would be missing.

So indeed, not representative of the whole Internet.

ziddoap1y ago

From the article:

>Specifically, when Common Crawl gets to a pdf, it just stores the first megabyte of information and truncates the rest.

seanw2651y ago

deweller1y ago

Is it possible that the 8 TB is just the extracted text?

tokai1y ago

No, the Safedocs dataset is unprocessed pdfs.

tokai1y ago

[0] https://link.springer.com/article/10.1007/s11192-015-1614-6

elorant1y ago

Anna's archive has some 300M pdfs.

tokai1y ago

We're talking about the open web here. But yeah that's the point, the dataset is unreasonably small.

moralestapia1y ago

Libgen size is ~33TB so, no, it's not "the largest corpus of PDFs online".

(Although you could argue libgen is not really "public" in the legal sense of the word, lol).

Disregarding that, the article is great!

(edit: why would someone downvote this, HN is becoming quite hostile lately)

tecleandor1y ago

I think Libgen is ~100TB, and the full Anna's Archive is near a PB.

They all probably contain lots of duplicates but...

https://annas-archive.se/datasets

matthewaveryusa1y ago

It's being down voted because your number is really off. Libgen's corpus is 100+ TB

simonw1y ago

8TB - ~8,000GB - is more than 33GB.

moralestapia1y ago

Whoops, typo!

But that's what the comments are for, not the downvotes.

dotancohen1y ago

I upvoted this comment because, though the number is wrong, it proves the point. The fact that the correct number proves the point even more, is a reason _not_ to downvote the comment.

mellosouls1y ago

I haven't downvoted you but it is presumably because of your hasty typing or lack of proofreading/research.

33TB (first google result from 5 years ago) not 33GB. Larger figures from more recently.

moralestapia1y ago

>hasty typing or lack of proofreading/research

This is exactly what I meant with "HN is becoming quite hostile"

* I brought up something I looked up to support GP's argument.

* The argument is correct.

* I do it in good faith.

* G is literally next to T.

* I even praise the article, while at it.

"Oh, but you made a typo!".

Good luck, guys. I'm out.

PS. I will give my whole 7 figure net worth, no questions asked, transferred immediately to any account of their choice, to anyone here who has not ever made a typo in their life.

7 more replies

reaperducer1y ago

(edit: why would someone downvote this, HN is becoming quite hostile lately)

ks20481y ago

    I don’t have 8TB laying around, but we can be a bit more clever.... In particular I cared about a specific column called url. I really care about the urls because they essentially tell us a lot more from a website than what meats the eye.

I'm I correct that it is only only using the URL of the PDF to do classification? Maybe still useful, but that's quite a different story than "classifying all the pdfs".

xattt1y ago

It’s just classifying the URLs if that’s the case.

The legwork to classify PDFs is already done, and the authorship of the article can go to anyone who can get a grant for a $400 NewEgg order for an 8TB drive.

j / k navigate · click thread line to collapse