> Except as expressly authorized by Y Combinator, you agree not to modify, copy, frame, scrape, rent, lease, loan, sell, distribute or create derivative works based on the Site or the Site Content, in whole or in part
Not to pretend this isn't widely happening behind the curtains already, but coming from a "Show HN" seems daring.
Try to have your account and its contents deleted. The best I was offered for my 2011-vintage account was to randomize the username, and the reason I was given was that browsing and old thread with a bunch of deleted comments "looks bad".
So, you could additionally give a license to the world to use your posted comments freely. That doesn't mean HN can't add terms to say clients can't copy the site as a condition for use.
https://news.ycombinator.com/item?id=46435308
https://github.com/DOSAYGO-STUDIO/HackerBook
The mods and community had no problem with it
Differences: Sharded SQLITE, used bigquery export, build script is open on GitHub, interactive “archived website” view of HN, updated weekly (each build takes a couple dollars on a custom GitHub runner)
I built my own pipeline with a slightly different setup. I use Go to download and process the data, and update it every 5 minutes using the HN API, trying to stay within fair use. It is also easy to tweak if someone wants faster or slower updates.
One part I really like is the "dynamic" README on Hugging Face. It is generated automatically by the code and keeps updating as new commits come in, so you can just open it and quickly see the current state.
The code is still a bit messy right now (I open sourced it together with around 3.6M lines across 100+ other tools, hidden in a corner of GitHub, anyone interested can play Sherlock Holmes and find it :) ), but I will clean it up, and open source as clearer new repository and write a proper blog post explaining how it works.
There have been tons of alternative frontends and projects using HN data over the years, posted to Show HN without an issue. I think their primary concern is interfering with the YCombinator brand itself. "the site" and "site content" referring to YCombinator and not HN specifically.
https://console.cloud.google.com/marketplace/details/y-combi...
The main reason I built this was to have HN data that is easy to query and always up to date, without needing to run your own pipeline first. There are also some interesting ideas in the pipeline, like what I call "auto-heal". Happy to share more if anyone is interested :)
A lot of the choices are trade-offs, as usual with data pipelines. I chose Parquet because it is columnar and compressed, so tools like DuckDB or Polars can read only the columns they need. This matters a lot as the dataset grows. I went with Hugging Face mainly because it is simple and already handles distribution and versioning. I can just push data as commits and get a built-in history without managing extra infrastructure (and, more conveniently, if you read the README, you can query it directly using Python or DuckDB).
The pipeline is incremental. Instead of rebuilding everything, it appends small batches every few minutes using the API. That keeps it fresh while staying cheap to run. The data is also partitioned by time, so queries do not need to scan the entire dataset (and I use very simple tech, just a Go binary running in a "screen" session, using only a few MB of RAM for the whole pipeline).
In my ongoing project, with 10 servers like this, I could index the large part internet (about 10 billion pages) using vector and full-text search.
I've been evaluating Gemini Embedding 2 using Hacker News comments and I wasted half a day making a wrapper for the HN API to collect some sample data to play with.
In case anyone is curious:
- The ability to simply truncate the provided embedding to a prefix (and then renormalize) is useful because it lets users re-use the same (paid!) embedding API response for multiple indexes at different qualities.
- Traditional enterprise software vendors are struggling to keep up with the pace of AI development. Microsoft SQL Server for example can't store a 3072 element vector with 32-bit floats (because that would be 12 KB and the page size is only 8 KB). It supports bfloat16 but... the SQL client doesn't! Or Entity Framework. Or anything else.
- Holy cow everything is so slow compared to full text search! The model is deployed in only one US region, so from Australia the turnaround time is something like 900 milliseconds. Then the vector search over just a few thousand entries with DiskANN is another 600-800 ms! I guess search-as-you-type is out of the question for... a while.
- Speaking of slow, the first thing I had to do was write an asynchronous parallel bounded queue data processor utility class in C# that supports chunking of the input and rate limit retries. This feels like it ought to be baked into the standard library or at least the AI SDKs because it's pretty much mandatory if working with anything other than "hello world" scenarios.
- Gemini Embedding 2 has the headline feature of multi-modal input, but they forgot to implement anything other than "string" for their IEmbeddingGenerator abstraction when used with Microsoft libraries. I guess the next "Preview v0.0.3-alpha" version or whatever will include it.
for text in texts:
key = (text, model)
if key not in pickle_cache:
pickle_cache[key] = openai_client.create_embedding(text, model=model)
embeddings.append(pickle_cache[key])
operations.save_pickle_cache(pickle_cache, pickle_path)
return embeddings
At the throughput rates I was seeing of one embedding per second, a million comments would take over a week to process!I had to call the Gemini model with ten comments at a time from eight threads to reach even the paltry 3K rpm rate limit they offer to "Tier 1" customers.
Based on this experience, for real "enterprise" customers I might implement a generic wrapper for Google's Batch API that could handle continuous streaming from a database, chunking it, uploading, and then in parallel checking the status of the pending jobs and streaming the results back into a database.
deleted and dead are integers. They are stored as 0/1 rather than booleans.
Is there a technical reason to do this? You have the type right there."Deleted" and "dead" are separate columns.
> So not only would it not make sense semantically, it would break if a third means were introduced.
If that was the intention, it would seem like a bad design decision to me. And actually what you assume to be the reasoning, is exactly what should be avoided. Which makes it a bad thing.
This is a limitation not because of having the bool value be represented by an int (or rather "be presented as"), but because of the t y p e , being an integer.
Your project actually helps me out a ton in making one of the new project ideas that I had about hackernews that I had put into the back-burner.
I had thought of making a ping website where people can just @Username and a service which can detect it and then send mail to said username if the username has signed up to the service (similar to a service run by someone from HN community which mails you everytime someone responds to your thread directly, but this time in a sort of ping)
[The previous idea came as I tried to ping someone to show them something relevant and thought that wait a minute, something like ping which mails might be interesting and then tried to see if I can use algolia or any service to hook things up but not many/any service made much sense back then sadly so I had the idea in back of my mind but this service sort of solves it by having it being updated every 5 minutes]
Your 5 minute updates really make it possible. I will look what I can do with that in some days but I am seeing some discrepancy in the 5 minute update as last seems to be 16 march in the readme so I would love to know more about if its being updated every 5 minutes because it truly feels phenomenal if true and its exciting to think of some new possibilities unlocked with it.
Wouldn't that lose deleted/moderated comments?
There is also flexibility in what you define as the dataset. Skinnier, but more focused tables could be space saving vs a wide table that covers everything -will probably break compressible runs of data.
The bigger concern is how large the git history is going to get on the repository.
This is likely to be lower traffic, and the history should (?) scale only linearly with new data, so likely not the worst thing. But it's something to be cognizant of when using SCM software in unexpected ways!
I have a similar project right now where I am scraping a dataset that is only ever offering the current state. I am trying to preserve the history of this dataset and was thinking of using the same strategy. If anyone has experience or pointers in how to best add time as a dimension to an existing generic dataset, I'd love to read about it.
So it's not really one big file getting replaced all the time. Though a less extreme variation of that is happening day to day.
If you need fresher data, let me know. I will open source the whole pipeline later.
> The archive currently spans from 2006-10 to 2026-03-16 23:55 UTC, with 47,358,772 items committed.
That’s more than 5 minutes ago by a day or two. No big deal, but a little bit depressing this is still how we do things in 2026.
So to get all the data you need to grab the archive and all the 5 minute update files.
archive data is here https://huggingface.co/datasets/open-index/hacker-news/tree/...
update files are here (I know that its called "today" but it actually includes all the update files which span multiple days at this point) https://huggingface.co/datasets/open-index/hacker-news/tree/...
they are suggesting that the huggingface description should be automatically updating the date & item count when the data gets updated.
Comments+posts are defined as user generated content, you have no right to its privacy/control in any capacity once you post it - https://www.ycombinator.com/legal/
YC in theory has the right to go after unauthorized 3rd parties scraping this data. YC funds startups and is deeply vested in the AI space. Why on Earth would they do that.
Copyright doesn't seem to matter unless you're an IP cartel or mega cap.
https://www.ycombinator.com/legal/
Mods, enforce your license terms, you're playing fast and loose with the law (GDPR/CPRA)
The user content is supposed to be licensed only Y Combinator and (bleah) its affiliated companies (which are many, all the startups they fund, for example).
Your submissions to, and comments you make on, the Hacker News site are not Personal Information and are not "HN Information" as defined in this Privacy Policy.
Other Users: certain actions you take may be visible to other users of the Services.
Then again, I'm not the guy that is going to get sued...
I agree. It's the owners of the sites that have to follow rules, not us.
And that, my friends, is how you kill the commons - by ignoring the social context surrounding its maintenance and insisting upon the most punitive ways of avoiding abuse.
I know, because I've been here since maybe 2015 or so, but this account was created in 2019.
So any PII you have mentioned in your comments is permanent on Hacker News.
I would appreciate it if they gave users the ability to remove all of their personal data, but in correspondence and in writing here on Hacker News itself, Dan has suggested that they value the posterity of conversations over the law.
https://www.ycombinator.com/legal/
See: User Content Transmitted Through the Site
That makes the implicit TOS agreement legally confusing depending on jurisdiction.
(Not that it really matters, but I find these technicalities amusing)