Newton, G., A. Callahan & M. Dumontier. 2009. Semantic Journal Mapping for Search Visualization in a Large Scale Article Digital Library. Second Workshop on Very Large Digital Libraries at the European Conference on Digital Libraries (ECDL) 2009. https://lekythos.library.ucy.ac.cy/bitstream/handle/10797/14...
I am the first author.
I can imagine mining all of these articles was a ton of work. I’d be curious to know how quickly the computation could be done today vs. the 13 hour 2009 benchmark :)
Nowadays people would be slamming those data through UMAP!
First author(s): the individual(s) who organized and conducted the study. Typically there is only a single first author, but nowadays there are often two first authors. This is because the amount of research required to generate “high impact” publications simply can’t be done by a single person. Typically, the first author is a Ph.D. student or lab scientist.
Middle authors: Individuals that provide critical effort, help, feedback, or guidance for the study and publication of the research. Different fields/labs have varying stringencies for what is considered “middle authorship worthy”. In many labs, simply being present and helping with the research warrants authorship. In other labs, you need to contribute a lot of energy to the project to be included as an author.
Senior author(s): The primary investigators (PI’s) or lead researchers that run the lab that conducted and published the study. The senior authors are typically the ones that acquire funding and oversee all aspects of the published research. PI’s have varying degrees of hands-on management.
There is some variation in whether the central research question for a manuscript is developed by the first vs. the senior author, but usually it’s the senior author. Also, the first and senior authors typically write the manuscript and seek edits/feedback from middle authors. In other cases, there can be dedicated writers that write the manuscript, who sometimes do/don’t get middle authorship. A main takeaway is: the general outline I’ve provided above is not strictly adhered to.
I’ll take some liberty to apply this outline to this article at hand:
First Author: G. Newton (OP). The scientist who mostly likely conducted all of the data mining and analysis. He likely wrote the article as well.
Middle Author: A. Callahan. It seems like this author was a grad student at the time the article was written. She likely performed essential work for the paper’s publication. This could’ve been: helping with the analysis, data mining, or ideation.
Senior Author: M. Dumontier. A data science professor, now at Maastricht U. He’s a highly cited scientist!
Lastly… if you check out the acknowledgements, you can see three additional names. These people likely helped with setting up compute access, editing, or general ideation.
This is a cool manuscript! Hopefully this overview isn’t TMI and provides some insight into the biomedical/data science publication process.
This post is a good example on why going straight to LLM embeddings for NLP is a pragmatic first step, especially for long documents.
I dug through the code and it seemed like a ton of things I'm not familiar with, probably a lot of techniques I don't know rather than python ofc.
The main difference were sites like chegg and many other sites started slurping them up to resell in some way.
I have a decent collection of ebooks/pdfs/manga from reading. But I can’t imagine how large a 20TB library is.
I can't speak for the OP, but you can buy optical media of old out-of-print magazines scanned as PDFs.
I bought the entirety of Desert Magazine from 1937-1985. It arrived on something like 15 CD-ROMS.
I drag-and-dropped the entire collection into iBooks, and read them when I'm on the train.
(Yes, they're probably on archive.org for free, but this is far easier and more convenient, and I prefer to support publishers rather than undermine their efforts.)
There is a significant issue with copyright, though. I'll remove anything with a valid DMCA, but 99.9% of the world's historical magazine issues are now in IP limbo as their ownership is probably unknown. Most of the other .1% aren't overly concerned as distribution is their goal and their main income is advertising, not sales.
Give the Aryn partitioning service a shot: https://www.aryn.ai/post/announcing-the-aryn-partitioning-se...
We recently released it and we've a few examples here: https://sycamore.readthedocs.io/en/stable/aryn_cloud/get_sta... that show you how to turn the tabular data from the pdf into a pandas dataframe(which you can then turn into csv).
I wanted to make a bit of an open source tool to pull down useful time series data for the social sciences (e.g. time series of social media comments about grocery prices). Seems like LLMs have unlocked all kinds of new research angles that people aren't using yet.
I may steal some of your good ideas if I ever get to work on that side project :)
Curious on your prompt: https://github.com/snat-s/m/blob/main/classify_metadata/prom...
Wouldn't this be basically prompting to classify by the type of URL?
Definitely as 'hot dog' or 'not a hot dog'.
(I'm a bit disappointed that most of the discussion is about estimating the size of PDFs on the internet, I'd love to hear more about different approaches to extracting better data from the PDFs.)
Full disclosure: I'm an employee
Ordering of statements.
1. (Title) Classifying all of the pdfs on the internet
2. (First Paragraph) Well not all, but all the PDFs in Common Crawl
3. (First Image) Well not all of them, but 500k of them.
I am not knocking the project, but while categorizing 500k PDFs is something we couldnt necessarily do well a few years ago, this is far from "The internet's PDFs".
For those of us who aren't familiar with this random acronym, I think RTBF = right to be forgotten.
Only EU bureaucracts would have the hubris to believe you could actually, comprehensively remove information from the Internet. Once something is spread, it is there, forever.
RTBF is about getting companies to get rid of any trace of you so they cannot use that data, not removing all traces about you across the internet.
Really depends on the content. Tons of websites are going down everyday, link rot is a real thing. Internet archive or people don't save nearly everything.
Something I should do more often is saving mhtml copies of webpages I find interesting.
It never meant that you have the right to ask "the Internet" as a whole to scrub you from all possible records, that's indeed ludicrous. And if someone took it to mean that and they were pushing for it, they were just confused, no serious law ever proposed that.
Right to be forgotten, not the Belgian public service broadcaster (https://en.wikipedia.org/wiki/RTBF)?
If you like, you could say, PDF are information dense, but data sparse. After all it is mostly white space ;)
So indeed, not representative of the whole Internet.
>Specifically, when Common Crawl gets to a pdf, it just stores the first megabyte of information and truncates the rest.
This is where SafeDocs or CC-MAIN-2021-31-PDF-UNTRUNCATED enters the picture. This corpus was originally created by the DARPA SafeDocs program and what it did was refetch all the different pdfs from a snapshot of Common Crawl to have untruncated versions of them.
[0] https://link.springer.com/article/10.1007/s11192-015-1614-6
(Although you could argue libgen is not really "public" in the legal sense of the word, lol).
Disregarding that, the article is great!
(edit: why would someone downvote this, HN is becoming quite hostile lately)
They all probably contain lots of duplicates but...
33TB (first google result from 5 years ago) not 33GB. Larger figures from more recently.
Also, there are browser extensions that will automatically downvote and/or hide HN comments that use words like "lol," or start with "So..." or include any of a number of words that the user considers indicative of low-grade content.
I don’t have 8TB laying around, but we can be a bit more clever.... In particular I cared about a specific column called url. I really care about the urls because they essentially tell us a lot more from a website than what meats the eye.
I'm I correct that it is only only using the URL of the PDF to do classification? Maybe still useful, but that's quite a different story than "classifying all the pdfs".The legwork to classify PDFs is already done, and the authorship of the article can go to anyone who can get a grant for a $400 NewEgg order for an 8TB drive.