Show HN: Hacker News archive (47M+ items, 11.6GB) as Parquet, updated every 5m (opens in new tab)

(huggingface.co)

408 pointstamnd12d ago168 comments

168 comments

From YC /legal

> Except as expressly authorized by Y Combinator, you agree not to modify, copy, frame, scrape, rent, lease, loan, sell, distribute or create derivative works based on the Site or the Site Content, in whole or in part

Not to pretend this isn't widely happening behind the curtains already, but coming from a "Show HN" seems daring.

Wowfunhappy7d ago

I can't comment on what is legal, but I very much dislike the idea that my comments are the property of Y Combinator. I assume that by writing here, I am putting information out into the world for anyone to use as they wish.

stackghost7d ago

HN/YC cares more about community aesthetics than your right to be forgotten.

Try to have your account and its contents deleted. The best I was offered for my 2011-vintage account was to randomize the username, and the reason I was given was that browsing and old thread with a bunch of deleted comments "looks bad".

1 more reply

hnfong7d ago

AFAICT, you retain the copyrights to your comments, but YC has a license to essentially do whatever they want with them.

So, you could additionally give a license to the world to use your posted comments freely. That doesn't mean HN can't add terms to say clients can't copy the site as a condition for use.

YVoyiatzis7d ago

Your comments, yes. But the contextual thread as a while, no.

keepamovin7d ago

I did a show hn a month or so back like this: https://hackerbook.dosaygo.com/

https://news.ycombinator.com/item?id=46435308

https://github.com/DOSAYGO-STUDIO/HackerBook

The mods and community had no problem with it

Differences: Sharded SQLITE, used bigquery export, build script is open on GitHub, interactive “archived website” view of HN, updated weekly (each build takes a couple dollars on a custom GitHub runner)

tamndOP7d ago

@keepamovin thanks, your project was a big inspiration for this.

I built my own pipeline with a slightly different setup. I use Go to download and process the data, and update it every 5 minutes using the HN API, trying to stay within fair use. It is also easy to tweak if someone wants faster or slower updates.

One part I really like is the "dynamic" README on Hugging Face. It is generated automatically by the code and keeps updating as new commits come in, so you can just open it and quickly see the current state.

The code is still a bit messy right now (I open sourced it together with around 3.6M lines across 100+ other tools, hidden in a corner of GitHub, anyone interested can play Sherlock Holmes and find it :) ), but I will clean it up, and open source as clearer new repository and write a proper blog post explaining how it works.

1 more reply

krapp7d ago

This site offers a public, non rate-limted API. IANAL but I'm reasonably certain that's authorization for anyone to use the data as long as they do so through the API. It certainly isn't the case that you need explicit legal permission to use Hacker News comment data in your project.

There have been tons of alternative frontends and projects using HN data over the years, posted to Show HN without an issue. I think their primary concern is interfering with the YCombinator brand itself. "the site" and "site content" referring to YCombinator and not HN specifically.

admiralrohan7d ago

Then why does the API is available for hackernews? If nothing is allowed to be copied legally. And why this post is approved as "Show HN" if it's illegal? Don't get the reasoning here.

maxloh5d ago

For what it's worth, there is an official, daily updated public dataset of all posts and comments. Therefore, the data clearly isn't something they consider a trade secret.

https://console.cloud.google.com/marketplace/details/y-combi...

NewJazz7d ago

Quite relevant:

https://en.wikipedia.org/wiki/HiQ_Labs_v._LinkedIn

tamndOP7d ago

[Author here] The whole pipeline runs on a single ~$10/month VPS, but it can process hundreds of TB even with just 12GB RAM and a 200GB SSD.

The main reason I built this was to have HN data that is easy to query and always up to date, without needing to run your own pipeline first. There are also some interesting ideas in the pipeline, like what I call "auto-heal". Happy to share more if anyone is interested :)

A lot of the choices are trade-offs, as usual with data pipelines. I chose Parquet because it is columnar and compressed, so tools like DuckDB or Polars can read only the columns they need. This matters a lot as the dataset grows. I went with Hugging Face mainly because it is simple and already handles distribution and versioning. I can just push data as commits and get a built-in history without managing extra infrastructure (and, more conveniently, if you read the README, you can query it directly using Python or DuckDB).

The pipeline is incremental. Instead of rebuilding everything, it appends small batches every few minutes using the API. That keeps it fresh while staying cheap to run. The data is also partitioned by time, so queries do not need to scan the entire dataset (and I use very simple tech, just a Go binary running in a "screen" session, using only a few MB of RAM for the whole pipeline).

asalahli7d ago

Where are you getting a ~$10/month VPS with 12GB RAM from?

tamndOP7d ago

https://contabo.com/en/vps/cloud-vps-20/ - $8/month, 6 vCPU, 12 GB RAM, 200 GB SSD (or Hetzner servers, which offer good hourly pricing).

In my ongoing project, with 10 servers like this, I could index the large part internet (about 10 billion pages) using vector and full-text search.

jiggawatts7d ago

I could have used this just yesterday!

I've been evaluating Gemini Embedding 2 using Hacker News comments and I wasted half a day making a wrapper for the HN API to collect some sample data to play with.

In case anyone is curious:

- The ability to simply truncate the provided embedding to a prefix (and then renormalize) is useful because it lets users re-use the same (paid!) embedding API response for multiple indexes at different qualities.

- Traditional enterprise software vendors are struggling to keep up with the pace of AI development. Microsoft SQL Server for example can't store a 3072 element vector with 32-bit floats (because that would be 12 KB and the page size is only 8 KB). It supports bfloat16 but... the SQL client doesn't! Or Entity Framework. Or anything else.

- Holy cow everything is so slow compared to full text search! The model is deployed in only one US region, so from Australia the turnaround time is something like 900 milliseconds. Then the vector search over just a few thousand entries with DiskANN is another 600-800 ms! I guess search-as-you-type is out of the question for... a while.

- Speaking of slow, the first thing I had to do was write an asynchronous parallel bounded queue data processor utility class in C# that supports chunking of the input and rate limit retries. This feels like it ought to be baked into the standard library or at least the AI SDKs because it's pretty much mandatory if working with anything other than "hello world" scenarios.

- Gemini Embedding 2 has the headline feature of multi-modal input, but they forgot to implement anything other than "string" for their IEmbeddingGenerator abstraction when used with Microsoft libraries. I guess the next "Preview v0.0.3-alpha" version or whatever will include it.

pjot7d ago

I did this but used duckdb as the vector store. Works really well, quite fast too.

https://github.com/patricktrainer/duckdb-embedding-search

jiggawatts7d ago

Unless I'm missing something, this uses a simple synchronous for loop:

    for text in texts:
        key = (text, model)
        if key not in pickle_cache:
            pickle_cache[key] = openai_client.create_embedding(text, model=model)
        embeddings.append(pickle_cache[key])
    operations.save_pickle_cache(pickle_cache, pickle_path)
    return embeddings

At the throughput rates I was seeing of one embedding per second, a million comments would take over a week to process!

I had to call the Gemini model with ten comments at a time from eight threads to reach even the paltry 3K rpm rate limit they offer to "Tier 1" customers.

Based on this experience, for real "enterprise" customers I might implement a generic wrapper for Google's Batch API that could handle continuous streaming from a database, chunking it, uploading, and then in parallel checking the status of the pending jobs and streaming the results back into a database.

2 more replies

0cf8612b2e1e8d ago

Under the Known Limitations section

  deleted and dead are integers. They are stored as 0/1 rather than booleans.

Is there a technical reason to do this? You have the type right there.

albedoa7d ago

By "to do this" do you mean to not use booleans? It's because the value does not represent a binary true or false but rather a means by which the item is deleted or dead. So not only would it not make sense semantically, it would break if a third means were introduced.

endofreach7d ago

> It's because the value does not represent a binary true or false but rather a means by which the item is deleted or dead.

"Deleted" and "dead" are separate columns.

> So not only would it not make sense semantically, it would break if a third means were introduced.

If that was the intention, it would seem like a bad design decision to me. And actually what you assume to be the reasoning, is exactly what should be avoided. Which makes it a bad thing.

This is a limitation not because of having the bool value be represented by an int (or rather "be presented as"), but because of the t y p e , being an integer.

1 more reply

0cf8612b2e1e7d ago

Funny, because the HackerNews API [0] does return booleans for those fields. That is, a state, not a type of deletion or death.

[0] https://github.com/HackerNews/API

1 more reply

Imustaskforhelp8d ago

As someone who had made a project analysing hackernews who had used clickhouse, I really feel like this is a project made for me (especially the updated every 5 minute aspect which could've helped my project back then too!)

Your project actually helps me out a ton in making one of the new project ideas that I had about hackernews that I had put into the back-burner.

I had thought of making a ping website where people can just @Username and a service which can detect it and then send mail to said username if the username has signed up to the service (similar to a service run by someone from HN community which mails you everytime someone responds to your thread directly, but this time in a sort of ping)

[The previous idea came as I tried to ping someone to show them something relevant and thought that wait a minute, something like ping which mails might be interesting and then tried to see if I can use algolia or any service to hook things up but not many/any service made much sense back then sadly so I had the idea in back of my mind but this service sort of solves it by having it being updated every 5 minutes]

Your 5 minute updates really make it possible. I will look what I can do with that in some days but I am seeing some discrepancy in the 5 minute update as last seems to be 16 march in the readme so I would love to know more about if its being updated every 5 minutes because it truly feels phenomenal if true and its exciting to think of some new possibilities unlocked with it.

xnx8d ago

The best source for this data used to be Clickhouse (https://play.clickhouse.com/play?user=play#U0VMRUNUIG1heCh0a...), but it hasn't updated since 2025-12-26.

mceoin7d ago

For the non-coders here, you can query and analyze all of play.clickhouse.com in Sourcetable's chat interface. You can also ask it for the code produced so you can copy/paste that back into the Clickhouse interface.

politician7d ago

This is great. I've soured on this site over the past few years due to the heavy partisanship that wasn't as present in the early days (eternal September), but there are still quite a few people whose opinions remain thought-provoking and insightful. I'm going to use this corpus to make a local self-hosted version of HN with the ability to a) show inline article summaries and b) follow those folks.

neom7d ago

"heavy partisanship" - I've seen this claim a few times and I find it a bit odd. Certainly I feel HN leans left, but I've never seen what I would consider a strong preference for any particular political party? When the American daggers do come out - it seems fairly split? Even the post about the Canadian meta data law the other day, left leaning maybe, but I see when partisan comments came out directly, it looked about even?

politician7d ago

I think we'll be able to quantify sentiment from the data, and I look forward to doing so. There's a few other datasets that I want to look at such as whether there is evidence of participation suppression via rate limiting on a per-profile basis.

1 more reply

brtkwr7d ago

This comment should make it into the download in a few mins.

tantalor7d ago

As should this reply

ericfr117d ago

Hello to myself for prosperity

1 more reply

lyu072828d ago

Please upload to https://academictorrents.com/ as well if possible

dacapoday6d ago

Very Nice Job! I built a small CLI to browse and query it from the terminal: https://github.com/dacapoday/hn

Onavo8d ago

Is is possible to only download a subset? e.g. Show HNs or HN Whoishiring. The Show HNs and HN Whoishiring are very useful for classroom data science i.e. a very useful set of data for students to learn the basic of data cleaning and engineering.

nelsondev8d ago

It’s date partitioned, you could download just a date range. It’s also parquet, so you can download just specific columns with the right client

palmotea8d ago

> At midnight UTC, the entire current month is refetched from the source as a single authoritative Parquet file, and today's individual 5-minute blocks are removed from the today/ directory.

Wouldn't that lose deleted/moderated comments?

BoredPositron8d ago

I guess that's the point.

Imustaskforhelp8d ago

Can't someone create an automatic script which can just copy the files say 5 minutes before midnight UTC?

gkbrk8d ago

My Hacker News items table in ClickHouse has 47,428,860 items, and it's 5.82 GB compressed and 18.18 GB uncompressed. What makes Parquet compression worse here, when both formats are columnar?

0cf8612b2e1e8d ago

Sorting, compression algorithm +level, and data types can all have an impact. I noted elsewhere that a Boolean is getting represented as an integer. That’s one bit vs 1-4 bytes.

There is also flexibility in what you define as the dataset. Skinnier, but more focused tables could be space saving vs a wide table that covers everything -will probably break compressible runs of data.

xnx8d ago

Parquet has a few compression option. Not sure which one they are using.

hirako20008d ago

Plus isn't the least wasteful format, native duckdb for instance compacts better. That's not just down to the compression algorithm, which as you say got three main options for parquet.

boznz7d ago

.. and Remove all the political shit-slop since COVID/AI and it's probably under a gig.

mulmen7d ago

You could download the data and run that analysis yourself. I’d be interested to see it, especially your method of identifying “political shit-slop” and “AI” and the relationship to COVID. Sounds like an interesting project.

alstonite8d ago

What happened between 2023 and 2024 to cause the usage dropoff?

ghgr8d ago

I'd say it's less a usage dropoff and more a reversion to the mean after Covid

tehjoker8d ago

That's a possible hypothesis, but there was also a rising trend prior, it wasn't stable.

imhoguy7d ago

Return to office

epogrebnyak7d ago

Wonder why median votes count is 0, seems every post is getting at least a few votes - maybe this was not the case in the past

epogrebnyak7d ago

Ahhh I get it the moment I asked, there are usually no votes on comments

estimator72927d ago

Don't all comments start out with one vote?

vovavili7d ago

Replacing an 11.6GB Parquet file every 5 minutes strikes me as a bit wasteful. I would probably use Apache Iceberg here.

ai-inquisitor7d ago

It's not doing that. If you look at the repository, it's adding a new commit with tiny parquet files every 5 minutes. This recent one only was a 20.9 KB parquet file: https://huggingface.co/datasets/open-index/hacker-news/commi... and the ones before it were a median of 5 KB: https://huggingface.co/datasets/open-index/hacker-news/tree/...

The bigger concern is how large the git history is going to get on the repository.

btown7d ago

I recall that this became a big problem for the Homebrew project in terms of load on the repo, to the extent that Github asked them not to recommend/default-enable shallow clones for their users: https://github.com/Homebrew/brew/issues/15497#issuecomment-1...

This is likely to be lower traffic, and the history should (?) scale only linearly with new data, so likely not the worst thing. But it's something to be cognizant of when using SCM software in unexpected ways!

1 more reply

vovavili7d ago

This makes more sense. I still wonder if the author isn't just effectively recreating Apache Iceberg manually here.

2 more replies

sureglymop7d ago

So they are sharding by time/day?

I have a similar project right now where I am scraping a dataset that is only ever offering the current state. I am trying to preserve the history of this dataset and was thinking of using the same strategy. If anyone has experience or pointers in how to best add time as a dimension to an existing generic dataset, I'd love to read about it.

zerocrates7d ago

"The dataset is organized as one Parquet file per calendar month, plus 5-minute live files for today's activity. Every 5 minutes, new items are fetched from the source and committed directly as a single Parquet block. At midnight UTC, the entire current month is refetched from the source as a single authoritative Parquet file, and today's individual 5-minute blocks are removed from the today/ directory."

So it's not really one big file getting replaced all the time. Though a less extreme variation of that is happening day to day.

tomrod7d ago

Parquet is a very efficient storage approach. Data interfaces tend to treat paths as partitions, if logical.

fabmilo7d ago

Was thinking the same thing. probably once a day would be more than enough. if you really want a minute by minute probably a delta file from the previous day should be more than enough.

lhoestq7d ago

Clickhouse should implement Parquet CDC to enable deduplication and faster uploads/downloads on HF

kshacker8d ago

Good for demo but every 5 minutes? Why?

Imustaskforhelp8d ago

It can have some good use cases I can think of. Personally I really appreciate the 5 minute update.

tamndOP7d ago

5 minutes is the sweet spot for many (enterprise) data pipelines too, in my experience.

tamndOP7d ago

"Near real-time" already covers almost 99% of production data needs.

If you need fresher data, let me know. I will open source the whole pipeline later.

imhoguy7d ago

Yay! So much knowledge in just 11GB. Adding to my end of the World hoarding stash!

sockaddr7d ago

Your family is starving and your dog died of radiation poisoning from the fallout but at least your local LLM can browse this and recommend a good software stack for your automated booby traps.

imhoguy6d ago

Not really, AI slop flood is also some kind of end of World. It gets harder and harder to hoard pre-2020 content.

maxloh7d ago

Could you also release the source code behind the automatic update system?

tamndOP7d ago

Will do. The code is currently messy, bundled with 3.6M LOC across 100+ other tools.

mlhpdx8d ago

Static web content and dynamic data?

> The archive currently spans from 2006-10 to 2026-03-16 23:55 UTC, with 47,358,772 items committed.

That’s more than 5 minutes ago by a day or two. No big deal, but a little bit depressing this is still how we do things in 2026.

voxic117d ago

That is just the archive part, if you just would finish reading the paragraph you would know that updates since 2026-03-16 23:55 UTC are "are fetched every 5 minutes and committed directly as individual Parquet files through an automated live pipeline, so the dataset stays current with the site itself."

So to get all the data you need to grab the archive and all the 5 minute update files.

archive data is here https://huggingface.co/datasets/open-index/hacker-news/tree/...

update files are here (I know that its called "today" but it actually includes all the update files which span multiple days at this point) https://huggingface.co/datasets/open-index/hacker-news/tree/...

mlhpdx7d ago

That paragraph doesn’t make it clear (to me) that it’s a snapshot with incremental updates. If that’s what it is. Sorry if my obtuse read offended. I just figured it was edge cached HTML, and less likely it was actually broken.

john_strinlai7d ago

>if you just would finish reading the paragraph

probably uncalled for

1 more reply

tamndOP7d ago

That’s a silly bug in the "dynamic" README, fixing it now.

xandrius8d ago

I don't get what you meant with this comment.

john_strinlai7d ago

the data updates every 5 minutes, but the description on huggingface says the last update was 2 days ago.

they are suggesting that the huggingface description should be automatically updating the date & item count when the data gets updated.

1 more reply

tonymet8d ago

what's the license for HN content?

BowBun7d ago

We have LLMs and links to TOS, this is easily answerable by _anyone_ on the internet at this point.

Comments+posts are defined as user generated content, you have no right to its privacy/control in any capacity once you post it - https://www.ycombinator.com/legal/

YC in theory has the right to go after unauthorized 3rd parties scraping this data. YC funds startups and is deeply vested in the AI space. Why on Earth would they do that.

tonymet7d ago

the implication was that training a model doesn't seem to abide by the TOS

echelon8d ago

At this point, you can train on anything without repercussion.

marginalia_nu8d ago

Laughs nervously in jurisdiction without fair use doctrine

robotswantdata7d ago

Where’s the opt out ?

john_strinlai7d ago

hackernews is very upfront that they do not really care about deletion requests or anything of that sort, so, the opt out is to not use hackernews.

lofaszvanitt7d ago

Time to sue them to oblivion :D.

BowBun7d ago

By posting comments on this site, you are relinquishing your right to that content. It belongs to YC and it is theirs to enforce, not yours. https://www.ycombinator.com/legal/

lofaszvanitt7d ago

There is no such thing under https://news.ycombinator.com/ when you create your user.

robotswantdata7d ago

Max Schrems would like a word

pkilgore7d ago

Is this legal advice?

1 more reply

ratg137d ago

Create a new account every so often, don’t leave any identifying information, occasionally switch up the way you spell words (British/US English), and alternate using different slang words and shorthand.

fdghrtbrt7d ago

And do what I do - paste everything into ChatGPT and have it rephrase it. Not because I need help writing, but because I’d rather not have my writing style used against me.

4 more replies

GeoAtreides7d ago

funnily enough, if everyone did this (at least make a new account often), it would prove more destructive to what HN (purposefully) wants to do than deleting the occasional account data

tantalor7d ago

The back button

GeoAtreides8d ago

is the legal page a placeholder, do words have no meaning?

https://www.ycombinator.com/legal/

Mods, enforce your license terms, you're playing fast and loose with the law (GDPR/CPRA)

Retr0id8d ago

Which terms are not being enforced? (not disagreeing I just don't feel like reading a large legal document)

GeoAtreides8d ago

> By uploading any User Content you hereby grant and will grant Y Combinator and its affiliated companies

The user content is supposed to be licensed only Y Combinator and (bleah) its affiliated companies (which are many, all the startups they fund, for example).

3 more replies

ungruntled8d ago

None that I could see:

Your submissions to, and comments you make on, the Hacker News site are not Personal Information and are not "HN Information" as defined in this Privacy Policy.

Other Users: certain actions you take may be visible to other users of the Services.

1 more reply

ryandvm8d ago

Eh, fuck that agreement. I'm kind of old school in that I believe if you put it on the internet without an auth-wall, people should be allowed to do whatever they want with it. The AI companies seem to agree.

Then again, I'm not the guy that is going to get sued...

hrmtst938377d ago

Legal theory about public data is fun right up until someone with money decides their ToS mean something and files suit, because courts are usually a lot less impressed by "I could access it in my browser" once you pulled millions of records with a scraper. Scrape if you want, just assume you're buying legal risk.

Ylpertnodi8d ago

> I believe if you put it on the internet without an auth-wall, people should be allowed to do whatever they want with it.

I agree. It's the owners of the sites that have to follow rules, not us.

kmeisthax8d ago

"I'm kind of old school in that I believe if you put grass on the ground without a fence, people should be allowed to do whatever they want with it. The noblemen with a thousand cows seem to agree."

And that, my friends, is how you kill the commons - by ignoring the social context surrounding its maintenance and insisting upon the most punitive ways of avoiding abuse.

2 more replies

hsuduebc28d ago

How is is he breaking gdpr here?

andrewmcwatters8d ago

They already refuse to comply with CPRA, instead electing to replace your username with a random 6(?) character string, prefixed with `_`, if I remember correctly.

I know, because I've been here since maybe 2015 or so, but this account was created in 2019.

So any PII you have mentioned in your comments is permanent on Hacker News.

I would appreciate it if they gave users the ability to remove all of their personal data, but in correspondence and in writing here on Hacker News itself, Dan has suggested that they value the posterity of conversations over the law.

1 more reply

bstsb8d ago

what’s the license? “do whatever the fuck you want with the data as long as you don’t get caught”? or does that only work for massive corporations

BoredPositron8d ago

The universal license.

trwhite7d ago

Hello. I didn’t consent to any of my HN comments being used in this way. Please kindly remove them.

nextaccountic7d ago

Did you consent to this? https://hn.algolia.com/

RIMR7d ago

You absolutely did consent to this.

https://www.ycombinator.com/legal/

See: User Content Transmitted Through the Site

cj7d ago

To be incredibly pedantic to the point of being irrelevant: technically the sign up page 1) doesn't have a clickwrap "I agree" checkbox, and 2) there's no link to the TOS on the sign up page.

That makes the implicit TOS agreement legally confusing depending on jurisdiction.

(Not that it really matters, but I find these technicalities amusing)

trwhite7d ago

I’m reading that paragraph now and fail to see anything about a relationship with huggingface or the user responsible for copying the data.

s0ss7d ago

Only Y Combinator and its affiliated companies have license, me thinks.

owyn7d ago

That's a good point, and I think this will be my last post on this site. I never added much value anyway.

trwhite7d ago

@dang What’s Hacker News’ official stance on this?

Kye7d ago

This isn't presented anywhere on signup.

j / k navigate · click thread line to collapse

168 comments

6thbit7d ago

From YC /legal

Not to pretend this isn't widely happening behind the curtains already, but coming from a "Show HN" seems daring.

Wowfunhappy7d ago

stackghost7d ago

HN/YC cares more about community aesthetics than your right to be forgotten.

1 more reply

hnfong7d ago

AFAICT, you retain the copyrights to your comments, but YC has a license to essentially do whatever they want with them.

So, you could additionally give a license to the world to use your posted comments freely. That doesn't mean HN can't add terms to say clients can't copy the site as a condition for use.

YVoyiatzis7d ago

Your comments, yes. But the contextual thread as a while, no.

keepamovin7d ago

I did a show hn a month or so back like this: https://hackerbook.dosaygo.com/

https://news.ycombinator.com/item?id=46435308

https://github.com/DOSAYGO-STUDIO/HackerBook

The mods and community had no problem with it

tamndOP7d ago

@keepamovin thanks, your project was a big inspiration for this.

1 more reply

krapp7d ago

admiralrohan7d ago

Then why does the API is available for hackernews? If nothing is allowed to be copied legally. And why this post is approved as "Show HN" if it's illegal? Don't get the reasoning here.

maxloh5d ago

For what it's worth, there is an official, daily updated public dataset of all posts and comments. Therefore, the data clearly isn't something they consider a trade secret.

https://console.cloud.google.com/marketplace/details/y-combi...

NewJazz7d ago

Quite relevant:

https://en.wikipedia.org/wiki/HiQ_Labs_v._LinkedIn

tamndOP7d ago

[Author here] The whole pipeline runs on a single ~$10/month VPS, but it can process hundreds of TB even with just 12GB RAM and a 200GB SSD.

asalahli7d ago

Where are you getting a ~$10/month VPS with 12GB RAM from?

tamndOP7d ago

https://contabo.com/en/vps/cloud-vps-20/ - $8/month, 6 vCPU, 12 GB RAM, 200 GB SSD (or Hetzner servers, which offer good hourly pricing).

In my ongoing project, with 10 servers like this, I could index the large part internet (about 10 billion pages) using vector and full-text search.

jiggawatts7d ago

I could have used this just yesterday!

I've been evaluating Gemini Embedding 2 using Hacker News comments and I wasted half a day making a wrapper for the HN API to collect some sample data to play with.

In case anyone is curious:

pjot7d ago

I did this but used duckdb as the vector store. Works really well, quite fast too.

https://github.com/patricktrainer/duckdb-embedding-search

jiggawatts7d ago

Unless I'm missing something, this uses a simple synchronous for loop:

    for text in texts:
        key = (text, model)
        if key not in pickle_cache:
            pickle_cache[key] = openai_client.create_embedding(text, model=model)
        embeddings.append(pickle_cache[key])
    operations.save_pickle_cache(pickle_cache, pickle_path)
    return embeddings

At the throughput rates I was seeing of one embedding per second, a million comments would take over a week to process!

I had to call the Gemini model with ten comments at a time from eight threads to reach even the paltry 3K rpm rate limit they offer to "Tier 1" customers.

2 more replies

0cf8612b2e1e8d ago

Under the Known Limitations section

  deleted and dead are integers. They are stored as 0/1 rather than booleans.

Is there a technical reason to do this? You have the type right there.

albedoa7d ago

endofreach7d ago

> It's because the value does not represent a binary true or false but rather a means by which the item is deleted or dead.

"Deleted" and "dead" are separate columns.

> So not only would it not make sense semantically, it would break if a third means were introduced.

If that was the intention, it would seem like a bad design decision to me. And actually what you assume to be the reasoning, is exactly what should be avoided. Which makes it a bad thing.

This is a limitation not because of having the bool value be represented by an int (or rather "be presented as"), but because of the t y p e , being an integer.

1 more reply

0cf8612b2e1e7d ago

Funny, because the HackerNews API [0] does return booleans for those fields. That is, a state, not a type of deletion or death.

[0] https://github.com/HackerNews/API

1 more reply

Imustaskforhelp8d ago

Your project actually helps me out a ton in making one of the new project ideas that I had about hackernews that I had put into the back-burner.

xnx8d ago

The best source for this data used to be Clickhouse (https://play.clickhouse.com/play?user=play#U0VMRUNUIG1heCh0a...), but it hasn't updated since 2025-12-26.

mceoin7d ago

politician7d ago

neom7d ago

politician7d ago

1 more reply

brtkwr7d ago

This comment should make it into the download in a few mins.

tantalor7d ago

As should this reply

ericfr117d ago

Hello to myself for prosperity

1 more reply

lyu072828d ago

Please upload to https://academictorrents.com/ as well if possible

dacapoday6d ago

Very Nice Job! I built a small CLI to browse and query it from the terminal: https://github.com/dacapoday/hn

Onavo8d ago

nelsondev8d ago

It’s date partitioned, you could download just a date range. It’s also parquet, so you can download just specific columns with the right client

palmotea8d ago

> At midnight UTC, the entire current month is refetched from the source as a single authoritative Parquet file, and today's individual 5-minute blocks are removed from the today/ directory.

Wouldn't that lose deleted/moderated comments?

BoredPositron8d ago

I guess that's the point.

Imustaskforhelp8d ago

Can't someone create an automatic script which can just copy the files say 5 minutes before midnight UTC?

gkbrk8d ago

My Hacker News items table in ClickHouse has 47,428,860 items, and it's 5.82 GB compressed and 18.18 GB uncompressed. What makes Parquet compression worse here, when both formats are columnar?

0cf8612b2e1e8d ago

Sorting, compression algorithm +level, and data types can all have an impact. I noted elsewhere that a Boolean is getting represented as an integer. That’s one bit vs 1-4 bytes.

xnx8d ago

Parquet has a few compression option. Not sure which one they are using.

hirako20008d ago

Plus isn't the least wasteful format, native duckdb for instance compacts better. That's not just down to the compression algorithm, which as you say got three main options for parquet.

boznz7d ago

.. and Remove all the political shit-slop since COVID/AI and it's probably under a gig.

mulmen7d ago

alstonite8d ago

What happened between 2023 and 2024 to cause the usage dropoff?

ghgr8d ago

I'd say it's less a usage dropoff and more a reversion to the mean after Covid

tehjoker8d ago

That's a possible hypothesis, but there was also a rising trend prior, it wasn't stable.

imhoguy7d ago

Return to office

epogrebnyak7d ago

Wonder why median votes count is 0, seems every post is getting at least a few votes - maybe this was not the case in the past

epogrebnyak7d ago

Ahhh I get it the moment I asked, there are usually no votes on comments

estimator72927d ago

Don't all comments start out with one vote?

vovavili7d ago

Replacing an 11.6GB Parquet file every 5 minutes strikes me as a bit wasteful. I would probably use Apache Iceberg here.

ai-inquisitor7d ago

The bigger concern is how large the git history is going to get on the repository.

btown7d ago

1 more reply

vovavili7d ago

This makes more sense. I still wonder if the author isn't just effectively recreating Apache Iceberg manually here.

2 more replies

sureglymop7d ago

So they are sharding by time/day?

zerocrates7d ago

So it's not really one big file getting replaced all the time. Though a less extreme variation of that is happening day to day.

tomrod7d ago

Parquet is a very efficient storage approach. Data interfaces tend to treat paths as partitions, if logical.

fabmilo7d ago

Was thinking the same thing. probably once a day would be more than enough. if you really want a minute by minute probably a delta file from the previous day should be more than enough.

lhoestq7d ago

Clickhouse should implement Parquet CDC to enable deduplication and faster uploads/downloads on HF

kshacker8d ago

Good for demo but every 5 minutes? Why?

Imustaskforhelp8d ago

It can have some good use cases I can think of. Personally I really appreciate the 5 minute update.

tamndOP7d ago

5 minutes is the sweet spot for many (enterprise) data pipelines too, in my experience.

tamndOP7d ago

"Near real-time" already covers almost 99% of production data needs.

If you need fresher data, let me know. I will open source the whole pipeline later.

imhoguy7d ago

Yay! So much knowledge in just 11GB. Adding to my end of the World hoarding stash!

sockaddr7d ago

Your family is starving and your dog died of radiation poisoning from the fallout but at least your local LLM can browse this and recommend a good software stack for your automated booby traps.

imhoguy6d ago

Not really, AI slop flood is also some kind of end of World. It gets harder and harder to hoard pre-2020 content.

maxloh7d ago

Could you also release the source code behind the automatic update system?

tamndOP7d ago

Will do. The code is currently messy, bundled with 3.6M LOC across 100+ other tools.

mlhpdx8d ago

Static web content and dynamic data?

> The archive currently spans from 2006-10 to 2026-03-16 23:55 UTC, with 47,358,772 items committed.

That’s more than 5 minutes ago by a day or two. No big deal, but a little bit depressing this is still how we do things in 2026.

voxic117d ago

So to get all the data you need to grab the archive and all the 5 minute update files.

archive data is here https://huggingface.co/datasets/open-index/hacker-news/tree/...

mlhpdx7d ago

john_strinlai7d ago

>if you just would finish reading the paragraph

probably uncalled for

1 more reply

tamndOP7d ago

That’s a silly bug in the "dynamic" README, fixing it now.

xandrius8d ago

I don't get what you meant with this comment.

john_strinlai7d ago

the data updates every 5 minutes, but the description on huggingface says the last update was 2 days ago.

they are suggesting that the huggingface description should be automatically updating the date & item count when the data gets updated.

1 more reply

tonymet8d ago

what's the license for HN content?

BowBun7d ago

We have LLMs and links to TOS, this is easily answerable by _anyone_ on the internet at this point.

Comments+posts are defined as user generated content, you have no right to its privacy/control in any capacity once you post it - https://www.ycombinator.com/legal/

YC in theory has the right to go after unauthorized 3rd parties scraping this data. YC funds startups and is deeply vested in the AI space. Why on Earth would they do that.

tonymet7d ago

the implication was that training a model doesn't seem to abide by the TOS

echelon8d ago

At this point, you can train on anything without repercussion.

marginalia_nu8d ago

Laughs nervously in jurisdiction without fair use doctrine

robotswantdata7d ago

Where’s the opt out ?

john_strinlai7d ago

hackernews is very upfront that they do not really care about deletion requests or anything of that sort, so, the opt out is to not use hackernews.

lofaszvanitt7d ago

Time to sue them to oblivion :D.

BowBun7d ago

By posting comments on this site, you are relinquishing your right to that content. It belongs to YC and it is theirs to enforce, not yours. https://www.ycombinator.com/legal/

lofaszvanitt7d ago

There is no such thing under https://news.ycombinator.com/ when you create your user.

robotswantdata7d ago

Max Schrems would like a word

pkilgore7d ago

Is this legal advice?

1 more reply

ratg137d ago

fdghrtbrt7d ago

And do what I do - paste everything into ChatGPT and have it rephrase it. Not because I need help writing, but because I’d rather not have my writing style used against me.

4 more replies

GeoAtreides7d ago

funnily enough, if everyone did this (at least make a new account often), it would prove more destructive to what HN (purposefully) wants to do than deleting the occasional account data

tantalor7d ago

The back button

GeoAtreides8d ago

is the legal page a placeholder, do words have no meaning?

https://www.ycombinator.com/legal/

Mods, enforce your license terms, you're playing fast and loose with the law (GDPR/CPRA)

Retr0id8d ago

Which terms are not being enforced? (not disagreeing I just don't feel like reading a large legal document)

GeoAtreides8d ago

> By uploading any User Content you hereby grant and will grant Y Combinator and its affiliated companies

The user content is supposed to be licensed only Y Combinator and (bleah) its affiliated companies (which are many, all the startups they fund, for example).

3 more replies

ungruntled8d ago

None that I could see:

Your submissions to, and comments you make on, the Hacker News site are not Personal Information and are not "HN Information" as defined in this Privacy Policy.

Other Users: certain actions you take may be visible to other users of the Services.

1 more reply

ryandvm8d ago

Then again, I'm not the guy that is going to get sued...

hrmtst938377d ago

Ylpertnodi8d ago

> I believe if you put it on the internet without an auth-wall, people should be allowed to do whatever they want with it.

I agree. It's the owners of the sites that have to follow rules, not us.

kmeisthax8d ago

"I'm kind of old school in that I believe if you put grass on the ground without a fence, people should be allowed to do whatever they want with it. The noblemen with a thousand cows seem to agree."

And that, my friends, is how you kill the commons - by ignoring the social context surrounding its maintenance and insisting upon the most punitive ways of avoiding abuse.

2 more replies

hsuduebc28d ago

How is is he breaking gdpr here?

andrewmcwatters8d ago

They already refuse to comply with CPRA, instead electing to replace your username with a random 6(?) character string, prefixed with `_`, if I remember correctly.

I know, because I've been here since maybe 2015 or so, but this account was created in 2019.

So any PII you have mentioned in your comments is permanent on Hacker News.

1 more reply

bstsb8d ago

what’s the license? “do whatever the fuck you want with the data as long as you don’t get caught”? or does that only work for massive corporations

BoredPositron8d ago

The universal license.

trwhite7d ago

Hello. I didn’t consent to any of my HN comments being used in this way. Please kindly remove them.

nextaccountic7d ago

Did you consent to this? https://hn.algolia.com/

RIMR7d ago

You absolutely did consent to this.

https://www.ycombinator.com/legal/

See: User Content Transmitted Through the Site

cj7d ago

To be incredibly pedantic to the point of being irrelevant: technically the sign up page 1) doesn't have a clickwrap "I agree" checkbox, and 2) there's no link to the TOS on the sign up page.

That makes the implicit TOS agreement legally confusing depending on jurisdiction.

(Not that it really matters, but I find these technicalities amusing)

trwhite7d ago

I’m reading that paragraph now and fail to see anything about a relationship with huggingface or the user responsible for copying the data.

s0ss7d ago

Only Y Combinator and its affiliated companies have license, me thinks.

owyn7d ago

That's a good point, and I think this will be my last post on this site. I never added much value anyway.

trwhite7d ago

@dang What’s Hacker News’ official stance on this?

Kye7d ago

This isn't presented anywhere on signup.

j / k navigate · click thread line to collapse