Show HN: Exploring HN by mapping and analyzing 40M posts and comments for fun (opens in new tab)

(blog.wilsonl.in)

520 pointswilsonzlin2y ago159 comments

159 comments

This is impressive work, especially for a one man show!

One thing that stood out to me was the graph of the sentiment analysis over time, I hadn't seen something like that before and it was interesting to see it for Rust. What were the most positive topics over time? And were there topics that saw very sudden drops?

I also found this sentence interesting, as it rings true to me about social media "there seems to be a lot of negative sentiment on HN in general." It would be cool to see a comparison of sentiment across social media platforms and across time!

wilsonzlinOP2y ago

Thanks! Yeah I'd like to dive deeper into the sentiment aspect. As you say it'd be interesting to see some overview, instead of specific queries.

The negative sentiment stood out to me mostly because I was expecting a more "clear-cut" sentiment graph: largely neutral-positive, with spikes in the positive direction around positive posts and negative around negative posts. However, for almost all my queries, the sentiment was almost always negative. Even positive posts apparently attracted a lot of negativity (according to the model and my approach, both of which could be wrong). It's something I'd like to dive deeper into, perhaps in a future blog post.

dylan6042y ago

The sentiment issue is a curious one to me. For example, a lot of humans I interact with that are not devs take my direct questioning or critical responses to be "negative" when there is no negative intent at all. Pointing out something doesn't work or anything that the dev community encounters on a daily basis isn't an immediate negative sentiment but just pointing out the issues. Is it a meme-like helicopter parent constantly doling out praise positive so that anything differing shows negativity? Not every piece of art needs to be hung on the fridge door, and providing constructive criticism for improvement is oh so often framed as negative. That does the world no favors.

Essentially, I'm not familiar with HuggingFace or any models in this regard. But if they are trained from the socials, then it seems skewed from the start to me.

Also, fully aware that this comment will probably be viewed as negative based on stated assumptions.

edit: reading further down the comments, clearly I'm not the first with these sentiments.

uyzstvqs2y ago

Speaking from experience, debate is easily misread as negative arguing by outsiders, even though all involved parties are enjoying challenging each other's ideas.

wilsonzlinOP2y ago

You may be right, a more tailored classifier for HN comments specifically may be more accurate. It'd be interesting to consider the classes: would it still be simply positive/negative? Perhaps constructive/unconstructive? Usefulness? Something more along the lines of HN guidelines?

prox2y ago

Just one point of note : people are FAR more likely to respond and take to writing to something negative than positive. I don’t know the exact numbers but it just engages people more. People just don’t pick up the pen to write how good something is as much.

flawsofar2y ago

Every helicopter gets a trophy

1 more reply

luke-stanley2y ago

I did something related for my ChillTranslator project for translating spicy HN comments to calm variations which has a GGUF model that runs easily and quickly but it's early days. I did it with a much smaller set of data, using LLM's to make calm variations and an algo to pick the closest least spicy one to make the synthetic training data then used Phi 2. I used Detoxify then OpenAI's sentiment analysis is free, I use that to verify Detoxify has correctly identified spicy comments then generate a calm pair. I do worry that HN could implode / degrade if there is not able to be a good balance for the comments and posts that people come here for. Maybe I can use your sentiment data to mine faster and generate more pairs. I've only done an initial end-to-end test so far (which works!). The model, so far is not as high quality as I'd like but I've not used Phi 3 on it yet and I've only used a very small fine-tune dataset so far. File is here though: https://huggingface.co/lukestanley/ChillTranslator I've had no feedback from anyone on it though I did have a 404 in my Show HN post!

deadbabe2y ago

Anecdotally, I think anyone who reads HN for a while will realize it to be a negative, cynical place.

Posts written in sweet syrupy tones wouldn’t do well here, and jokes are in short supply or outright banned. Most people here also seem to be men. There’s always someone shooting you down. And after a while, you start to shoot back.

xanderlewis2y ago

(Without wanting to sound negative or cynical) I don’t think it is, but maybe I haven’t been here long enough to notice. It skews towards technical and science and technology-minded people, which makes it automatically a bit ‘cynical’, but I feel like 95% of commenters are doing so at least in good faith. The same cannot be said of many comparable discussion forums or social media websites.

Jokes are also not banned; I see plenty on here. Low-effort ones and chains of unfunny wordplay or banter seem to be frowned upon though. And that makes it cleaner.

1 more reply

flir2y ago

I think it's the engineering mindset. You're always trying to figure out what's wrong with an idea, because you might be the poor bastard that ends up having to build it. Less costly all round if you can identify the flaw now, not halfway through sprint 7. After a while it bleeds into everything you do.

chiefalchemist2y ago

> Anecdotally, I think anyone who reads HN for a while will realize it to be a negative, cynical place.

Sure, sometimes. But usually it's

Truth seeking > group thinking

There's a fine line between critical and cynical. Sometimes that line gets crossed. Sometimes the ambiguity of text-only comms clouds the water.

darby_eight2y ago

> Anecdotally, I think anyone who reads HN for a while will realize it to be a negative, cynical place.

I don't think this is particularly unique to HN. Anonymous forums tend to attract contrarian assholes. Perhaps this place is more, erm, poorly socially-adapted to the general population, but I don't see it as very far outside the norm outside of the average wealth of the posters.

holoduke2y ago

Really? Mmm i think hn is a place with on avarage above intelligent people. People who understand that their opinion is not the only one. I rarely have issues with people here. Might be also because we are all in the same bubble here.

abakker2y ago

its so interesting that in Likert scale surveys, I tend to see huge positivity bias/agreement bias, but comments tend to be critical/negative. I think there is something related to the format of feedback that skews the graph in general.

On HN, my theory is that positivity is the upvotes, and negativity/criticality is the discussion.

Personally, my contribution to your effort is that I would love to see a tool that could do this analysis for me over a dataset/corpus of my choosing. The code is nice, but it is a bit beyond me to follow in your footsteps.

walterbell2y ago

Great work! Would you consider adding support for search-via-url, e.g. https://hn.wilsonl.in/?q=sentiment+analysis. It would enable sharing and bookmarks of stable queries.

wilsonzlinOP2y ago

Thanks for the suggestion, I've just added the feature:

https://hn.wilsonl.in/s/sentiment%20analysis

al_hag2y ago

It will be a deep dive into the most essential of HN staples, the nitpick

walterbell2y ago

> sentiment across social media platforms and across time!

Also time zones and weekday/weekend.

kcorbitt2y ago

I actually did a blog post a few months ago where I analyzed HN commenter sentiment across AI, blockchain, remote work and Rust. The final graph at the very end of the post is the relevant one on this topic!

https://openpipe.ai/blog/hn-ai-crypto

abe942y ago

thanks, the sentiment in these graphs seem more positive in comparison. Did you run the sentiment on the whole corpus? What did that look like?

necovek2y ago

It's really unfortunate the HN API does not provide votes on comments: I wonder if and how sentiment analysis would change if they were weighted by votes/downvotes?

My unsupported take is that engineers are mostly critical, but will +1 positive feedback instead of repeating it, as they might for critism :)

moneywoes2y ago

Crypto i imagine is in that bucket

gieksosz2y ago

HN is a pretty toxic place indeed.

Swizec2y ago

> HN is a pretty toxic place indeed

This may be a personal style difference, but I find HN to be the least toxic of all social media I’ve tried. LinkedIn would be my example of ultra toxicity – the aggressive positivity there is unbearable. At least on HN people tell you what they think and even use a constructive decently argumented approach to doing so.

HN to me feels like a good technical discussion where people tear apart ideas instead of each other.

But yeah if you put a lot of ego into your ideas, HN must be an awful place to visit.

rossant2y ago

I agree, HN is much less toxic than about any other place on the internet.

swatcoder2y ago

How did you get from negative sentiment to toxicity? Are those the same to you?

It may be a cultural thing, but I think many people see negative sentiment as a constructive tool and a demonstration of trust and respect among people who recognize each others as robust and capable peers.

Avoiding it is something you do with people who you believe need special delicacy: whether because they've told you so, because they intimidate you, or because you sense something pitiable and fragile about them.

If you can trust that it's given in good faith, and by the guidelines of HN you are asked to, negative sentiment should be seen as an expression that someone thinks you're a fully capable adult and peer. Personally, I deeply appreciate that it's generally so comfortably shared and received here and would never include "toxicity" in one of my critiques of HN.

It's a surprising thing to read someone say!

(Unless you're thinking of the nastiness that can surface on flamewar topics, but there are numerous means by which those get downranked and displaced, and they're otherwise sparse and easy to avoid.)

gieksosz2y ago

Negative sentiment is more general than toxicity in my understanding - but it does include it. The fact that the study found HN consistently negative does not surprise me, one of the ways HN is negative (the most disruptive and which makes me post here less often) is indeed toxic comments. But I am still here (in the comments no less) so the benefit still outweighs the pain.

taco-hands2y ago

Perhaps... it can be toxic if you dip into the comments sometimes... Otherwise the content and links are the stuff of gold!

gieksosz2y ago

links are indeed the best. It is hard not to click on the comments however, which is a roll of a dice.

CuriouslyC2y ago

Good example of data engineering/MLops for people who aren't familiar.

I'd suggest using HDBScan to generate hierarchical clusters for the points, then use a model to generate names for interior clusters. That'll make it easy to explore topics out to the leaves, as you can just pop up refinements based on the connectivity to the current node using the summary names.

The groups need more distinct coloring, which I think having clusters could help with. The individual article text size should depend on how important or relevant the article is, either in general or based on the current search. If you had more interior cluster summaries that'd also help cut down on some of the text clutter, as you could replace multiple posts with a group summary until more zoomed in.

zetazzed2y ago

For folks with GPUs, note that HDBscan is very optimized in cuML (https://docs.rapids.ai/api/cuml/stable/api/#clustering / https://developer.nvidia.com/blog/faster-hdbscan-soft-cluste...).

jszymborski2y ago

Ooo thanks for this

wilsonzlinOP2y ago

Thanks for the great pointers! I didn't get the time to look into hierarchical clustering unfortunately but it's on my TODO list. Your comment about making the map clearer is great and something I think there's a lot of low-hanging approaches for improving. Another thing for the TODO list :)

ComputerGuru2y ago

Amazing work, I'm impressed by the scope of your project!

I must say though, is it jina or bge-3/flag - the embeddings (and tokenizer?) do not do a good job on tech topics. It's fine for natural words, but searching for tech concepts like "xaml", "simd", etc cause it fall back to tokenizing the inputs and tries to grab similar sounding words.

Also, just some constructive feedback, if there were some way to stop it from showing the same "hn leaderboard" of results when there are no results because a topic is too niche would be nice. I get a lot of "Stephen Hawking has died" when searching for words the embeddings aren't familiar with.

Edit: I'm not so sure how well the sentiment analysis is working. I had the feeling that there was too much negative sentiment that didn't match up to reality, so I tried looking up things HN would feel overwhelmingly positive about like "Mr Rogers", I mean, who could feel negatively about him? The results show some serious negative spikes. Look up "Carter" and there's a massive negative peak associated with the passing of Rosalynn Carter. It was an HN submission talking about all the wonderful things the Carters did.

Also, I think the "popularity over time" needs to be scaled by the median number of votes a story got that month/year, because the trend lines just go up and up if you plot strictly the number of posts. Look at the popularity of "diesel" and you'll see what I mean - this is a term that peaked ten years ago! Or perhaps it should be some sort of keyword incidence rate or number of items with a cosine similarity index of less than x from the query rather than post score, maybe?

Edit2: The dynamic "click a post to remove and recalculate similarity threshold" is awesome.

tarasglek2y ago

How does one tell programmatically that any given embedding model doesn't recognize a term or word?

oersted2y ago

Here's a great tool that does almost exactly the same thing for any dataset: https://github.com/enjalot/latent-scope

Obviously the scale of OP's project adds a lot of interesting complexity, this tool cannot handle that, but it's great for medium-sized datasets.

rantymcrant2y ago

I'd like to see an analysis of the rise of self promotion on HN.

I define self promotion on HN as a "Show HN: I ..." post vs "Show HN: Something ..."

Examples from the top 100 right now

* "Show HN: Exploring HN by mapping and analyzing 40M posts and comments for fun"

* "Show HN: Browser-based knitting (pattern) software"

These are not self promotional titles. The subjects are the exploration and the software respectively.

* "Show HN: I built a non-linear UI for ChatGPT"

* "Show HN: I created 3,800+ Open Source React Icons"

These are self promotional titles. The subject of each is "I"

My own simple check just via algolia search results checking for titles that start with "Show HN: I" gave these results for years starting April 1st. Graphed divided by the total number of results for that year

    2023 ****************************************
    2022 ***********************************
    2021 ***************************
    2020 **************************************
    2019 *************************
    2018 *************
    2017 *******
    2016 **********
    2015 ********
    2014 ************
    2013 *********************
    2012 *****************
    2011 *********
    2010 ***

I feel like maybe I grew up in a time when generally, self promotion was considered a bad character trait. Your actions are supposed to be what promotes you, calling attention to them is not but I feel that culture is changing.

I wonder if the rise in self promotion (assuming there is a rise) has to do with social media etc...

I perceive a similar rise on Youtube but I have no data, just a feeling from the number of youtube recommendations for videos of "I....."

Thorrez2y ago

Your definition of self promotion is a bit different from what I usually think. I usually consider self promotion to be someone promoting something that that same person did. Both of your non-self-promotion examples would be self promotion under my definition.

So what you consider to be self promotion vs non-self-promotion, I consider to be self promotion with a title that very clearly indicates that vs self promotion with a title that less clearly indicates that. However, the "Show HN" phrase is only used for self promotion I think, so even without the "I", anyone familiar with the convention will know it's self promotion.

rantymcrant2y ago

> However, the "Show HN" phrase is only used for self promotion I think, so even without the "I", anyone familiar with the convention will know it's self promotion.

I think that's an extremely cynical view though a common one. I've never thought of "Show HN" as self promotion if it doesn't include "I" unless I go through to the actual product/library/post and find it full of self promotion. I agree with you that a post that doesn't include "I" can be self promotion but I don't think it always is even if the person made/worked on it.

"Show HN: XYZ and LLM library in rust" to me is informational. It's point is, more often than not, to inform people of something they might get use out of. I know that's true when I've posted something like that. It's meaning is "here's a useful resource that was just created". Sure I get pleasure from knowing I helped people with something but I'm not trying to promote myself, I'm trying to promote the library/post/info.

"Show HN: I made an LLM Library in rust" to me is self promotional. It might be useful to others but it's intent was clearly self promotion given the subject is "I", not the library/post/product.

Thorrez2y ago

>Show HN is for sharing your personal work and has special rules.

https://news.ycombinator.com/newsfaq.html

1 more reply

satvikpendem2y ago

Show HN is defined in the rules (as the sibling comment quotes) as something someone made to be shared, ie self promotion, regardless of whether they used "I" in the title. Your definition seems more arbitrary than what Hacker News itself intends.

wodenokoto2y ago

All show HN has to be created by the author, so I’m not sure what is self promoting about making the implicit explicit.

They are all “look, I made something cool, what do you think?”

amitlevy492y ago

This is talked about a lot in Einstein's Walter Isaacson biography, so people have been observing this trend for a long time (e.g the Germans accusing Einstein of doing self promotion, the US having celebrity culture in contrast), maybe it's cyclical

replete2y ago

I think this is easily the coolest post I've seen on HN this year

seanlinehan2y ago

It was not obvious at first glance to me, but the actual app is here: https://hn.wilsonl.in/

uncertainrhymes2y ago

I'm curious if the link to the landing page was intentionally near the end. Only the people who actually read it would go to the site.

(That's not a dig, I think it's a good idea.)

bravura2y ago

1) it doesn’t appear search links are shareable or have the query terms are in it

2) are you embedding the search phrases word by word? And using the same model as the documents used? Because I searched for „lead generation“ which any decent non-unigram embedding should understand, but I got results for lead poisoning.

oschvr2y ago

I found me and my post there ! Nice

minimaxir2y ago

A modern recommendation for UMAP is Parametric UMAP (https://umap-learn.readthedocs.io/en/latest/parametric_umap....), which instead trains a small Keras MLP to perform the dimensionality reduction down to 2D by minimizing the UMAP loss. The advantage is that this model is small and can be saved and reused to predict on unknown new data (a traditionally trained UMAP model is large), and training is theoetically much faster because GPUs are GPUs.

The downside is that the implementation in the Python UMAP package isn't great and creates/pushes the whole expanded node/edge dataset to the GPU, which means you can only train it on about 100k embeddings before going OOM.

The UMAP -> HDBSCAN -> AI cluster labeling pipeline that's all unsupervised is so useful that I'm tempted to figure out a more scalable implementation of Parametric UMAP.

Der_Einzige2y ago

It exists in cuML with a fast GPU implementation. Not sure why cuMl is so poorly known though…

minimaxir2y ago

I'll give that a look: the feature set of GPU-accelerated ops seems just up my alley for this pipeline: https://github.com/rapidsai/cuml

EDIT: looking through the docs it's just GPU-acceletated UMAP, not a parametric UMAP which trains a NN model. That's easy to work around though by training a new NN model to predict the reduced dimensionality values and minimizing rMSE.

minimaxir2y ago

Tested it out and the UMAP implementation with this library is very very fast compared to Parametric UMAP: running it on 100k embeddings took about 7 seconds when the same pipeline on the same GPU took about a half-hour. I will definitely be playing around with it more.

1 more reply

bravura2y ago

From a quick glance, it appears that it's because the implementation pushes the entire graph (all edges) to the GPU. Sampling of edges during training could alleviate this.

minimaxir2y ago

Indeed, TensorFlow likes pushing everything to the GPU by default whereas many PyTorch DL implementations encourage feeding data from the CPU to the GPU as needed with a DataLoader.

There have been attempts at a PyTorch port of Parametric UMAP (https://github.com/lmcinnes/umap/issues/580) but nothing as good.

bravura2y ago

Looks like there is a little motion on this topic:

https://github.com/lmcinnes/umap/pull/1103

oersted2y ago

This is a surprisingly big endeavour for what looks like an exploratory hobby project. Not to minimize the achievement, very cool, I'm just surprised by how much was invested into it.

They used 150 GPUs and developed two custom systems (db-rpc and queued) for inter-server communication, and this was just to compute the embeddings, there's a lot of other work and computation surrounding it.

I'm curious about the context of the project, and how someone gets this kind of funding and time for such research.

PS: Having done a lot of similar work professionally (mapping academic paper and patent landscapes), I'm not sure if 150 GPUs were really needed. If you end up just projecting to 2D and clustering, I think that traditional methods like bag-of-words and/or topic modelling would be much easier and cheaper, and the difference in quality would be unnoticeable. You can also use author and comment-thread graphs for similar results.

wilsonzlinOP2y ago

Hey, thanks for the kind words. I wasn't able to mention the costs in the post (might follow up in the future) but it was in the hundreds of dollars, so was reasonably accessible as a hobby project. The GPUs were surprisingly cheap, and was only scaled up mostly because I was impatient :) --- the entire cluster only ran for a few hours.

Do you have any links to your work? They sound interesting and I'd like to read more about them.

oersted2y ago

"Hundreds of dollars" sounds a bit painful as an EU engineer and entrepreneur :), but I guess it's all relative. We would think twice about investing this much manpower and compute for such an exploratory project even in a commercial setting if it was not directly funded by a client.

But your technical skill is obvious and very impressive.

If you want to read more, my old bachelor's thesis is somewhat related, from when we only had word embeddings and document embeddings were quite experimental still: https://ad-publications.cs.uni-freiburg.de/theses/Bachelor_J...

I've done a lot follow-up work in my startup Scitodate, which includes large-scale graph and embedding analysis, but we haven't published most of it for now.

gardenhedge2y ago

A golf membership can cost 1000s of euro.. Any hobby costs money

wilsonzlinOP2y ago

Thanks for sharing, I'll have a read, looks very relevant and interesting!

b800h2y ago

As an EU-based engineer, you wouldn't do this, it's a massive GDPR violation (failure to notify data subjects of data processing), which does actually have extraterritoriality, although I somehow doubt that the information commissioners are going to be coming after OP.

1 more reply

alchemist1e92y ago

The author is definitely very skilled. I find it interesting they submit posts on HN but haven’t commented since 2018! And then embarked on this project.

As far as funding/time, one possibility is they are between endeavors/employment and it’s self funded as they have had a successful career or business financially. They were very efficient at the GPU utilization so it probably didn’t cost that much.

wilsonzlinOP2y ago

Thanks! Haha yeah I'm trying to get into the habit of writing about and sharing the random projects I do more often. And yeah the cost was surprisingly low (in the hundreds of dollars), so it was pretty accessible as a hobby project.

PaulHoule2y ago

(1) Definitely you could use a cheaper embedding and still get pretty good results

(2) I apply classical ML (say probability calibrated SVM) to embeddings like that and get good results for classification and clustering at speeds over 100x fine-tuning an LLM.

Karrot_Kream2y ago

I didn't think the OP used LLMs? They did use a BERT based sentiment classifier but that's not an LLM.

My HN recommender works fine just using decision trees and XGBoost FWIW. I'll bet SVM would work great too.

PaulHoule2y ago

Some of the SBERT models now are based on T5 and newer architectures so there's not. The FlagEmbedding model that the author uses

https://huggingface.co/BAAI/bge-base-en-v1.5

is described as an "LLM" by the people who created it. It can be used in the SBERT framework.

I tried quite a few models for my RSS feed recommender (applied after taking the embedding) and SVM came out ahead of everything else. Maybe with parameter tuning XGBoost would do better but it was not a winner for me.

If you look at the literature

https://arxiv.org/abs/2405.00704

you find that the fashionable LLMs are not world-beating at many tasks and actually you can do very well at sentiment analysis applying the LSTM to unpooled BERT output.

1 more reply

jxy2y ago

> We can see that in this case, where perhaps the X axis represents "more cat" and Y axis "more dog", using the euclidean distance (i.e. physical distance length), a pitbull is somehow more similar to a Siamese cat than a "dog", whereas intuitively we'd expect the opposite. The fact that a pitbull is "very dog" somehow makes it closer to a "very cat". Instead, if we take the angle distance between lines (i.e. cosine distance, or 1 minus angle), the world makes sense again.

Typically the vectors are normalized, instead of what's shown in this demonstration.

When using normalized vectors, the euclidean distance measures the distance between the two end points of the respective vectors. While the cosine distance measures the length of one vector projected onto the other.

GeneralMayhem2y ago

The issue with normalization is that you lose a degree of freedom - which when you're visualizing, effectively means losing a dimension. Normalized 2d vectors are really just 1d vectors; if you want to show a 2d relationship, now you have to use 3d vectors (so that you have 2 degrees of freedom again).

ed_db2y ago

This is amazing, the amount of skill and knowledge involved is very impressive.

wilsonzlinOP2y ago

Thank you for the kind words!

pudiklubi2y ago

This is wild. I've been creating my own dataset of trending articles and ironically this is how I came across your post. I'm doing a similar project for my uni thesis.

I set out with similar hypotheses and goals like you (on a slightly different scale though, haha) but I've been completely stuck on the interactive map part. Definitely getting a lot of pointers from how you handled this!

Maybe one key difference in approach is that I've put more emphasis on trying to extract key topics as keywords.

For ex:

article (title): "Useful Uses of cat"

keywords: ['Software design', 'Contraction', 'Code changes', 'Modularity', 'Ease of extension']

My hypothesis is this will be a faster search solution than using the embeddings, but potentially not as accurate. Not that far yet to really prove this though.

Would love to hear what you think! Any other cool ideas on what could be done with the keywords? I explain my process a bit more here if interested: https://hackernews-demo.streamlit.app/#data-aggregation-meth...

stavros2y ago

This search engine is amazing. I was looking for an old story about curing acid reflux by some exercise, Google/DDG/Kagi/HN's Algolia were completely useless, this found it first hit. Well done, this is the HN search engine I've always wanted.

Is it possible to keep it up to date?

thyrox2y ago

Very nice. Since Hn data spawns so many such fun projects, there should be a monthly or weekly updates zip file or torrent with this data, which hackers can just download instead of writing a scraper and starting from scratch all the time.

zX41ZdbW2y ago

It is very easy to get this dataset directly from HN API. Let me just post it here:

Table definition:

    CREATE TABLE hackernews_history
    (
        update_time DateTime DEFAULT now(),
        id UInt32,
        deleted UInt8,
        type Enum('story' = 1, 'comment' = 2, 'poll' = 3, 'pollopt' = 4, 'job' = 5),
        by LowCardinality(String),
        time DateTime,
        text String,
        dead UInt8,
        parent UInt32,
        poll UInt32,
        kids Array(UInt32),
        url String,
        score Int32,
        title String,
        parts Array(UInt32),
        descendants Int32
    )
    ENGINE = MergeTree(update_time) ORDER BY id;

A shell script:

    BATCH_SIZE=1000

    TWEAKS="--optimize_trivial_insert_select 0 --http_skip_not_found_url_for_globs 1 --http_make_head_request 0 --engine_url_skip_empty_files 1 --http_max_tries 10 --max_download_threads 1 --max_threads $BATCH_SIZE"

    rm -f maxitem.json
    wget --no-verbose https://hacker-news.firebaseio.com/v0/maxitem.json

    clickhouse-local --query "
        SELECT arrayStringConcat(groupArray(number), ',') FROM numbers(1, $(cat maxitem.json))
        GROUP BY number DIV ${BATCH_SIZE} ORDER BY any(number) DESC" |
    while read ITEMS
    do
        echo $ITEMS
        clickhouse-client $TWEAKS --query "
            INSERT INTO hackernews_history SELECT * FROM url('https://hacker-news.firebaseio.com/v0/item/{$ITEMS}.json')"
    done

It takes a few hours to download the data and fill the table.

tarasglek2y ago

May I hijack this thread for a related q. I love the public up-to-date hn dataset.

I saw recursive cte blog post..but this doesn't seem to work your hn dataset

https://play.clickhouse.com/play?user=play#V0lUSCBSRUNVUlNJV...

Are recursive ctes disabled on this instance or am i doing something wrong?

zX41ZdbW2y ago

Done, and now it works perfectly.

1 more reply

zX41ZdbW2y ago

This is unclear to me, I will ask the author.

1 more reply

strooper2y ago

While trying the script, I am getting the following error -

<Trace> ReadWriteBufferFromHTTP: Failed to make request to 'https://hacker-news.firebaseio.com/v0/item/40298680.json'. Error: Timeout: connect timed out: 216.239.32.107:443. Failed at try 3/10. Will retry with current backoff wait is 200/10000 ms.

I googled with no luck. I was wondering if you have a solution for it.

zX41ZdbW2y ago

It makes many requests in parallel, and that's why some of them could be retried. It logs every retry, e.g., "Failed at try 3/10". It will throw an error only if it fails all ten tries. The number of retries is defined in the script.

Example of how it should work:

    $ ch -q "SELECT * FROM url('https://hacker-news.firebaseio.com/v0/item/40298680.json')" --format Vertical
    Row 1:
    ──────
    by:     octopoc
    id:     40298680
    parent: 40297716
    text:   Oops, thanks. I guess Marx was being referenced? I had thought Marx was English but apparently he was German-Jewish[1]<p>[1] <a href="https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Karl_Marx" rel="nofollow">https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Karl_Marx</a>
    time:   1715179584
    type:   comment

zX41ZdbW2y ago

Also, a proof that it is updated in real-time: https://play.clickhouse.com/play?user=play#U0VMRUNUICogRlJPT...

minimaxir2y ago

There is a public dataset of Hacker News posts on BigQuery, but it unfortunately has only been updated up to November 2022: https://news.ycombinator.com/item?id=19304326

pfarrell2y ago

I have a daily updated dataset that has the HN data split out by months. I've published it on my web page, but it’s served from my home server so I don’t want to link to it directly. Each month is about 30mb of compressed csv. I’ve wanted to torrent it, but don’t know how to get enough seeders since each month will produce a new torrent file (unless I’m mistaken). If you’re interested, send me a message. My email is mrpatfarrell. Use gmail for the domain.

remram2y ago

As a starting point, that project has Apache Arrow files. I don't know if they'll update them though.

https://github.com/wilsonzlin/hackerverse/releases/tag/datas...

The comments text table is 13 GB, to give you an idea. Can definitely be processed on a laptop.

noman-land2y ago

I very much support this idea. Put them on ipfs and/or torrents. Put them on HuggingFace.

pfarrell2y ago

I’ve had this same thought but was unsure what the licensing for the data would be.

average_r_user2y ago

that's a nice idea

coolspot2y ago

Absolutely wonderful project and even more so the writeup!

Feedback: on my iOS phone, once you select a dot on the map, there is no way to unselect it. Preview card of some articles takes full screen, so I can’t even click to another dot. Maybe add a “cross” icon for the preview card or make that when you tap outside of a card, it hides whole card strip?

wilsonzlinOP2y ago

Thank you! And thanks for raising that issue. I've pushed a fix that should hopefully mitigate this for you: it's possible to unselect, card images are hidden on mobile, and the invisible results area around a card (caused by the tallest card stretching the results area) should no longer intercept map touches. Let me know if it helps!

swozey2y ago

I'm.. shocked there's been 40 million posts. Wow.

Really neat work

edit: Also had no idea HN went back to 2006. https://news.ycombinator.com/item?id=1

edit2: PG wrote this? https://news.ycombinator.com/item?id=487171

c17r2y ago

An HN "item" is not just posts but everything: posts, comments, the parts of a poll, etc.

Still an impressive number

chossenger2y ago

Awesome visualisation, and great write-up. On mobile (in portrait), a lot of longer titles get culled as their origin scrolls off, with half of it still off the other side of the screen - wonder if it'd be worth keeping on rendering them until the entire text field is off screen (especially since you've already got a bounding box for them).

I stumbled upon [1] using it that reflects your comments on comment sentiment.

This also reminded me of [2] (for which the site itself had rotted away, incidentally) - analysing HN users' similarity by writing style.

[1] https://minimaxir.com/2014/10/hn-comments-about-comments/ [2] https://news.ycombinator.com/item?id=33755016

wilsonzlinOP2y ago

Thanks for the kind words, and raising that problem --- I've added it as an issue to fix.

Thanks for sharing that article, it was an interesting read. It was cool how deep the analysis went with a few simple statistical methods.

datguyfromAT2y ago

What a great read! Thats for taking the time and effort to provide the inside into your process

kriro2y ago

Very nice project and documented really well. I learned a lot reading the post. The examples of the improved HN search are pretty awesome.

Any idea why password reuse is so far away from security? That was the only oddity of the map for me.

chatman2y ago

Worth trying Cagra (Raft)/CuVS and Lucene-CuVS for the vector search. (https://github.com/SearchScale/lucene-cuvs)

NeroVanbierv2y ago

Really love the island map! But the automatic zooming on the map doesn't seem very relevant. E.g. try typing "openai" - I can't see anything related to that query in that part of the map

oersted2y ago

Indeed I've long been intreagued by the idea of rendering such clustering maps more like geographic maps for better readability.

It would be cool to have analogous continents, countries, sub-regions, roads, different-sized settlements, and significant landmarks... This version looks great at the highest zoom level, but rapidly becomes hard to interpret as you zoom in, same as most similar large embedding or graph visualizations.

NeroVanbierv2y ago

Ok I just noticed there is a region "OpenAI" in the north-west, but for some reason it zooms in somewhere close to "Apple" (middle of the island) when I type the query

wilsonzlinOP2y ago

Thanks! Yeah sometimes there are one or two "far" away results which make the auto zoom seem strange. It's something I'd like to tune, perhaps zooming to where most but not all results are.

luke-stanley2y ago

Often embeddings are not so good for comparing similarity of text. A cross-encoder might be a good alternative, perhaps as a second-pass, since you already have the embeddings. https://www.sbert.net/docs/pretrained_cross-encoders.html Pairwise, this can be quite slow, but as a second pass, it might be much higher quality. Obviously this gets into LLM's territory, but the language models for this can be small and more reliable than cosine on embeddings.

celltalk2y ago

It would be cool to see yearly changes of UMAP, by different years or the overall evolution in pseudotime on the embedding. Such a cool side project!

graiz2y ago

Would be cool to see member similarity. Finding like-minded commentors/posters may help discover content that would be of interest.

naveen992y ago

We implemented member similarity in our hacker read app: https://apps.apple.com/in/app/hacker-read/id6479697844

Once you register on ios, you can also login through webapp: https://hn.garglet.com

probably not ready for a hacker news hug of death yet, but you can try.

vsnf2y ago

Reminds me of a similar project a few months ago whose purpose was to unmask alt accounts. It wasn’t well received as I recall.

noman-land2y ago

Accidental dating app.

internetter2y ago

> Accidental dating app.

Possibly the greatest indicator of social startup success.

Lerc2y ago

A suggestion for analysis:

Compare topics/sentiment etc. by number of users and by number of posts.

Are some topics dominated by a few prolific posters? Positively or negatively.

Also, How does one seperate negative/positive sentiment to criticism/advocacy?

How hard is it to detect positive criticism, or enthusiastic endorsement of an acknowledged bad thing?

paddycap2y ago

Adding a subscribe feature to get an email with the most recent posts in a topic/community would be really cool. One of my favorite parts of HN is the weekly digest I get in my inbox; it would be awesome if that were tailored to me.

What you've built is really impressive. I'm excited to see where this goes!

wilsonzlinOP2y ago

Thanks! Yeah if there's enough interested users I'd love to turn this into a live service. Would an email subscription to a set of communities you pick be something you'd be interested in?

tomthe2y ago

I made something very similar a few weeks ago. I also included usernames with the average of their comments: https://tomthe.github.io/hackmap/

xnx2y ago

As a novice, is there a benefit to using custom Node as the downloader? When I did my download of the 40 million Hacker News api items I used "curl --parallel".

What I would like to figure out is the easiest way to go from the API straight into a parquet file.

wilsonzlinOP2y ago

I think your curl approach would work just as fine if not better. My instinct was to reach for Node.js out of familiarity, but curl is fast and, given the IDs are sequential, something like `parallel curl ::: $(seq 0 $max_id)` would be pretty simple and fast. I did end up needing more logic though so Node.js did ultimately come in handy.

As for the Arrow file, I'm not sure unfortunately. I imagine there are some difficulties because the format is columnar, so it probably wants a batch of rows (when writing) instead of one item at a time.

gaauch2y ago

A long term side project of mine is to try to build a recommendation algorithm trained on HN data.

I trained a model to predict if a given post will reach the front page, get flagged etc, I collected over a 1000 RSS feeds and rank the RSS entries with my ranking models.

I submit the high ranking entries on HN to test out my models and I can reach the front page consistently sometimes having multiple entries on the front page at a given time.

I also experiment with user->content recommendation, for that I use comment data for modeling interactions between users and entries, which seems to work fine.

Only problem I have is that I get a lot of 'out of distribution' content in my RSS feeds which causes my ranking models to get 'confused' for this I trained models to predict if a given entry belongs HN or not. On top of that I have some tagging models trained on data I scraped from lobste.rs and hand annotated.

I had been working on this on and off for the last 2 years or so, this account is not my main, and just one I created for testing.

AMA

Foreignborn2y ago

Could you explain more about what you mean by modeling interactions between comments and entities?

saganus2y ago

did you find if submitted entries are more likely to reach the frontpage depending on the title or the content?

i.e. do HN users upvote more based on the title of the article or on actually reading them?

gaauch2y ago

I tried making an LLM generate different titles for a given article and compared their ranking scores. There seems to be a lot of variation in the ranking scores based on the way the title is worded. Titles that are more likely to generate 'outrage' seems to be getting ranked higher, but at the same time that increases is_hn_flagged score which tries to predict if a entry will get flagged.

ashu14612y ago

This is pretty great.

Feature request : Is it possible to show in the graph how famous the topic / sub topic / article is ?

So that we can do an educated exploration in the graph around what was upvoted and what was not ?

wilsonzlinOP2y ago

Thanks! Do you mean within the sentiment/popularity analysis graph? Or the points and topics within the map?

ashu14612y ago

Points and topics within the map.

cyclecount2y ago

I can’t tell from the documentation on GitHub: does the API expose the flagged/dead posts? It would be interesting to see statistics on what’s been censored lately.

aeonik2y ago

I couldn't help but notice that Hy is on the map but Clojure isn't.

Am I out of touch?

https://hylang.org

fancy_pantser2y ago

HN submissions and comments are very different on weekends (and US holidays). Your data could explore and quantify this in some very interesting ways!

gsuuon2y ago

This is super cool! Both the writeup and the app. It'd be great if the search results linked to the HN story so we can check out the comments.

nojvek2y ago

I'm impressed with the map component in canvas. It's very smooth, dynamic zoom and google-maps like.

Gonna dig more into it.

Exemplary Show HN! We need more of this.

sourcepluck2y ago

Where is lisp?! I thought it was a verifiable (urban) legend around these parts that this forum is obssessed with lisp..?

pinkmuffinere2y ago

Maybe lisp is so niche that even a rather small interest makes HN relatively lispy?

gitgud2y ago

Very cool! I was hoping to be able to navigate to the HN post from the map though? Is that possible?

gardenhedge2y ago

AI is the most popular topic (by far) that I could find. Is there anything more popular?

dfworks2y ago

If anybody found this interesting and would like some further reading, the paper below employed a similar strategy to analyse inauthentic content/disinformation on Twitter.

https://files.casmconsulting.co.uk/message-based-community-d...

If you would like to read about my largely unsuccessful recreation of the paper, you can do so here - https://dfworks.xyz/blog/partygate/

carte_blanche2y ago

Getting "Argo tunnel error" on the page

wilsonzlinOP2y ago

Thanks for the heads up, just fixed this.

Venkatesh102y ago

This is the type of content I'm here for.

racosa2y ago

Very cool project. Thanks for sharing it!

freediver2y ago

If you have a blog, add an RSS feed :)

breck2y ago

I tried to fetch his RSS too! :)

Turns out, there's only 1 post so far on his blog.

Hoping for more! This one is great.

redbell2y ago

Truly, amazing work! Not only because of the final results, but also because of the whole process it took the author to bring this to life. If I could upvote this by giving points from my karma, I wouldn't hesitate to easily give a hundred points. Without a doubt, I would classify this on par with "40k HN comments mentioning books, extracted using deep learning" (https://news.ycombinator.com/item?id=28595967), which is the highest-voted "Show HN" project related to hacker news so far with 1359 points.

I'm not in the ML/AI arena yet, so I couldn't fully understand the second half of the article except for having a general idea about Embeddings and their potential, but the first part is what interests me as a software engineer.

Following are some of the challenges the author came across, was able to overcome each of them, and published the full source code.

Downloading HN database

> There's also a maxitem.json API, which gives the largest ID. As of this writing, the max item ID is over 40 million. Even with a very nice and low 10 ms mean response time, this would take over 4 days to crawl, so we need some parallelism.

> I've exported the HN crawler [1] (in TypeScript) to its own project, if you're ever in need to fetch HN items.

Fetching and parsing linked URLs' HTML for metadata and text

> For text posts and comments, the answer is simple. However, for the vast majority of link posts, this would mean crawling those pages being linked to. So I wrote up a quick Rust service [2] to fetch the URLs linked to and parse the HTML for metadata (title, picture, author, etc.) and text. This was CPU-intensive so an initial Node.js-based version was 10x slower and a Rust rewrite was worthwhile. Fortunately, other than that, it was mostly smooth and painless, likely because HN links are pretty good (responsive servers, non-pathological HTML, etc.).

Recovering missing/dead links

> A lot of content even on Hacker News suffers from the well-known link rot: around 200K resulted in a 404, DNS lookup failure, or connection timeout, which is a sizable "hole" in the dataset that would be nice to mend. Fortunately, the Internet Archive has an API that we can use to use to programmatically fetch archived copies of these pages. So, as a final push for a more "complete" dataset, I used the Wayback API to fetch the last few thousands of articles, some dating back years, which was very annoying because IA has very, very low rate limits (around 5 per minute).

Finding a cost-effective cloud provider for GPUs

> Fortunately, I discovered RunPod, a provider of machines with GPUs that you can deploy your containers onto, at a cost far cheaper than major cloud providers. They also have more cost-effective GPUs like RTX 4090, while still running in datacenters with fast Internet connections. This made scaling up a price-accessible option to mitigate the inference time required.

This is the type of content that makes HN stands out from the crowd.

_____________________________

1. https://github.com/wilsonzlin/crawler-toolkit-hn/

2. https://github.com/wilsonzlin/hackerverse/tree/master/crawle...

Igor_Wiwi2y ago

how much you paid to generate those embeddings?

dangoodmanUT2y ago

excellent work

password43212y ago

Related a month ago:

A Peek inside HN: Analyzing ~40M stories and comments

https://news.ycombinator.com/item?id=39910600

2 more replies

callalex2y ago

“Cloud Computing” “us-east-1 down”

This gave me a belly laugh.

1 more reply

j / k navigate · click thread line to collapse

159 comments

abe942y ago

This is impressive work, especially for a one man show!

wilsonzlinOP2y ago

Thanks! Yeah I'd like to dive deeper into the sentiment aspect. As you say it'd be interesting to see some overview, instead of specific queries.

dylan6042y ago

Essentially, I'm not familiar with HuggingFace or any models in this regard. But if they are trained from the socials, then it seems skewed from the start to me.

Also, fully aware that this comment will probably be viewed as negative based on stated assumptions.

edit: reading further down the comments, clearly I'm not the first with these sentiments.

uyzstvqs2y ago

Speaking from experience, debate is easily misread as negative arguing by outsiders, even though all involved parties are enjoying challenging each other's ideas.

wilsonzlinOP2y ago

prox2y ago

flawsofar2y ago

Every helicopter gets a trophy

1 more reply

luke-stanley2y ago

deadbabe2y ago

Anecdotally, I think anyone who reads HN for a while will realize it to be a negative, cynical place.

xanderlewis2y ago

Jokes are also not banned; I see plenty on here. Low-effort ones and chains of unfunny wordplay or banter seem to be frowned upon though. And that makes it cleaner.

1 more reply

flir2y ago

chiefalchemist2y ago

> Anecdotally, I think anyone who reads HN for a while will realize it to be a negative, cynical place.

Sure, sometimes. But usually it's

Truth seeking > group thinking

There's a fine line between critical and cynical. Sometimes that line gets crossed. Sometimes the ambiguity of text-only comms clouds the water.

darby_eight2y ago

> Anecdotally, I think anyone who reads HN for a while will realize it to be a negative, cynical place.

holoduke2y ago

abakker2y ago

On HN, my theory is that positivity is the upvotes, and negativity/criticality is the discussion.

walterbell2y ago

Great work! Would you consider adding support for search-via-url, e.g. https://hn.wilsonl.in/?q=sentiment+analysis. It would enable sharing and bookmarks of stable queries.

wilsonzlinOP2y ago

Thanks for the suggestion, I've just added the feature:

https://hn.wilsonl.in/s/sentiment%20analysis

al_hag2y ago

It will be a deep dive into the most essential of HN staples, the nitpick

walterbell2y ago

> sentiment across social media platforms and across time!

Also time zones and weekday/weekend.

kcorbitt2y ago

https://openpipe.ai/blog/hn-ai-crypto

abe942y ago

thanks, the sentiment in these graphs seem more positive in comparison. Did you run the sentiment on the whole corpus? What did that look like?

necovek2y ago

It's really unfortunate the HN API does not provide votes on comments: I wonder if and how sentiment analysis would change if they were weighted by votes/downvotes?

My unsupported take is that engineers are mostly critical, but will +1 positive feedback instead of repeating it, as they might for critism :)

moneywoes2y ago

Crypto i imagine is in that bucket

gieksosz2y ago

HN is a pretty toxic place indeed.

Swizec2y ago

> HN is a pretty toxic place indeed

HN to me feels like a good technical discussion where people tear apart ideas instead of each other.

But yeah if you put a lot of ego into your ideas, HN must be an awful place to visit.

rossant2y ago

I agree, HN is much less toxic than about any other place on the internet.

swatcoder2y ago

How did you get from negative sentiment to toxicity? Are those the same to you?

It's a surprising thing to read someone say!

gieksosz2y ago

taco-hands2y ago

Perhaps... it can be toxic if you dip into the comments sometimes... Otherwise the content and links are the stuff of gold!

gieksosz2y ago

links are indeed the best. It is hard not to click on the comments however, which is a roll of a dice.

CuriouslyC2y ago

Good example of data engineering/MLops for people who aren't familiar.

zetazzed2y ago

For folks with GPUs, note that HDBscan is very optimized in cuML (https://docs.rapids.ai/api/cuml/stable/api/#clustering / https://developer.nvidia.com/blog/faster-hdbscan-soft-cluste...).

jszymborski2y ago

Ooo thanks for this

wilsonzlinOP2y ago

ComputerGuru2y ago

Amazing work, I'm impressed by the scope of your project!

Edit2: The dynamic "click a post to remove and recalculate similarity threshold" is awesome.

tarasglek2y ago

How does one tell programmatically that any given embedding model doesn't recognize a term or word?

oersted2y ago

Here's a great tool that does almost exactly the same thing for any dataset: https://github.com/enjalot/latent-scope

Obviously the scale of OP's project adds a lot of interesting complexity, this tool cannot handle that, but it's great for medium-sized datasets.

rantymcrant2y ago

I'd like to see an analysis of the rise of self promotion on HN.

I define self promotion on HN as a "Show HN: I ..." post vs "Show HN: Something ..."

Examples from the top 100 right now

* "Show HN: Exploring HN by mapping and analyzing 40M posts and comments for fun"

* "Show HN: Browser-based knitting (pattern) software"

These are not self promotional titles. The subjects are the exploration and the software respectively.

* "Show HN: I built a non-linear UI for ChatGPT"

* "Show HN: I created 3,800+ Open Source React Icons"

These are self promotional titles. The subject of each is "I"

    2023 ****************************************
    2022 ***********************************
    2021 ***************************
    2020 **************************************
    2019 *************************
    2018 *************
    2017 *******
    2016 **********
    2015 ********
    2014 ************
    2013 *********************
    2012 *****************
    2011 *********
    2010 ***

I wonder if the rise in self promotion (assuming there is a rise) has to do with social media etc...

I perceive a similar rise on Youtube but I have no data, just a feeling from the number of youtube recommendations for videos of "I....."

Thorrez2y ago

rantymcrant2y ago

> However, the "Show HN" phrase is only used for self promotion I think, so even without the "I", anyone familiar with the convention will know it's self promotion.

"Show HN: I made an LLM Library in rust" to me is self promotional. It might be useful to others but it's intent was clearly self promotion given the subject is "I", not the library/post/product.

Thorrez2y ago

>Show HN is for sharing your personal work and has special rules.

https://news.ycombinator.com/newsfaq.html

1 more reply

satvikpendem2y ago

wodenokoto2y ago

All show HN has to be created by the author, so I’m not sure what is self promoting about making the implicit explicit.

They are all “look, I made something cool, what do you think?”

amitlevy492y ago

replete2y ago

I think this is easily the coolest post I've seen on HN this year

seanlinehan2y ago

It was not obvious at first glance to me, but the actual app is here: https://hn.wilsonl.in/

uncertainrhymes2y ago

I'm curious if the link to the landing page was intentionally near the end. Only the people who actually read it would go to the site.

(That's not a dig, I think it's a good idea.)

bravura2y ago

1) it doesn’t appear search links are shareable or have the query terms are in it

oschvr2y ago

I found me and my post there ! Nice

minimaxir2y ago

The UMAP -> HDBSCAN -> AI cluster labeling pipeline that's all unsupervised is so useful that I'm tempted to figure out a more scalable implementation of Parametric UMAP.

Der_Einzige2y ago

It exists in cuML with a fast GPU implementation. Not sure why cuMl is so poorly known though…

minimaxir2y ago

I'll give that a look: the feature set of GPU-accelerated ops seems just up my alley for this pipeline: https://github.com/rapidsai/cuml

minimaxir2y ago

1 more reply

bravura2y ago

From a quick glance, it appears that it's because the implementation pushes the entire graph (all edges) to the GPU. Sampling of edges during training could alleviate this.

minimaxir2y ago

Indeed, TensorFlow likes pushing everything to the GPU by default whereas many PyTorch DL implementations encourage feeding data from the CPU to the GPU as needed with a DataLoader.

There have been attempts at a PyTorch port of Parametric UMAP (https://github.com/lmcinnes/umap/issues/580) but nothing as good.

bravura2y ago

Looks like there is a little motion on this topic:

https://github.com/lmcinnes/umap/pull/1103

oersted2y ago

This is a surprisingly big endeavour for what looks like an exploratory hobby project. Not to minimize the achievement, very cool, I'm just surprised by how much was invested into it.

I'm curious about the context of the project, and how someone gets this kind of funding and time for such research.

wilsonzlinOP2y ago

Do you have any links to your work? They sound interesting and I'd like to read more about them.

oersted2y ago

But your technical skill is obvious and very impressive.

I've done a lot follow-up work in my startup Scitodate, which includes large-scale graph and embedding analysis, but we haven't published most of it for now.

gardenhedge2y ago

A golf membership can cost 1000s of euro.. Any hobby costs money

wilsonzlinOP2y ago

Thanks for sharing, I'll have a read, looks very relevant and interesting!

b800h2y ago

1 more reply

alchemist1e92y ago

The author is definitely very skilled. I find it interesting they submit posts on HN but haven’t commented since 2018! And then embarked on this project.

wilsonzlinOP2y ago

PaulHoule2y ago

(1) Definitely you could use a cheaper embedding and still get pretty good results

(2) I apply classical ML (say probability calibrated SVM) to embeddings like that and get good results for classification and clustering at speeds over 100x fine-tuning an LLM.

Karrot_Kream2y ago

I didn't think the OP used LLMs? They did use a BERT based sentiment classifier but that's not an LLM.

My HN recommender works fine just using decision trees and XGBoost FWIW. I'll bet SVM would work great too.

PaulHoule2y ago

Some of the SBERT models now are based on T5 and newer architectures so there's not. The FlagEmbedding model that the author uses

https://huggingface.co/BAAI/bge-base-en-v1.5

is described as an "LLM" by the people who created it. It can be used in the SBERT framework.

If you look at the literature

https://arxiv.org/abs/2405.00704

you find that the fashionable LLMs are not world-beating at many tasks and actually you can do very well at sentiment analysis applying the LSTM to unpooled BERT output.

1 more reply

jxy2y ago

Typically the vectors are normalized, instead of what's shown in this demonstration.

GeneralMayhem2y ago

ed_db2y ago

This is amazing, the amount of skill and knowledge involved is very impressive.

wilsonzlinOP2y ago

Thank you for the kind words!

pudiklubi2y ago

This is wild. I've been creating my own dataset of trending articles and ironically this is how I came across your post. I'm doing a similar project for my uni thesis.

Maybe one key difference in approach is that I've put more emphasis on trying to extract key topics as keywords.

For ex:

article (title): "Useful Uses of cat"

keywords: ['Software design', 'Contraction', 'Code changes', 'Modularity', 'Ease of extension']

My hypothesis is this will be a faster search solution than using the embeddings, but potentially not as accurate. Not that far yet to really prove this though.

stavros2y ago

Is it possible to keep it up to date?

thyrox2y ago

zX41ZdbW2y ago

It is very easy to get this dataset directly from HN API. Let me just post it here:

Table definition:

    CREATE TABLE hackernews_history
    (
        update_time DateTime DEFAULT now(),
        id UInt32,
        deleted UInt8,
        type Enum('story' = 1, 'comment' = 2, 'poll' = 3, 'pollopt' = 4, 'job' = 5),
        by LowCardinality(String),
        time DateTime,
        text String,
        dead UInt8,
        parent UInt32,
        poll UInt32,
        kids Array(UInt32),
        url String,
        score Int32,
        title String,
        parts Array(UInt32),
        descendants Int32
    )
    ENGINE = MergeTree(update_time) ORDER BY id;

A shell script:

    BATCH_SIZE=1000

    TWEAKS="--optimize_trivial_insert_select 0 --http_skip_not_found_url_for_globs 1 --http_make_head_request 0 --engine_url_skip_empty_files 1 --http_max_tries 10 --max_download_threads 1 --max_threads $BATCH_SIZE"

    rm -f maxitem.json
    wget --no-verbose https://hacker-news.firebaseio.com/v0/maxitem.json

    clickhouse-local --query "
        SELECT arrayStringConcat(groupArray(number), ',') FROM numbers(1, $(cat maxitem.json))
        GROUP BY number DIV ${BATCH_SIZE} ORDER BY any(number) DESC" |
    while read ITEMS
    do
        echo $ITEMS
        clickhouse-client $TWEAKS --query "
            INSERT INTO hackernews_history SELECT * FROM url('https://hacker-news.firebaseio.com/v0/item/{$ITEMS}.json')"
    done

It takes a few hours to download the data and fill the table.

tarasglek2y ago

May I hijack this thread for a related q. I love the public up-to-date hn dataset.

I saw recursive cte blog post..but this doesn't seem to work your hn dataset

https://play.clickhouse.com/play?user=play#V0lUSCBSRUNVUlNJV...

Are recursive ctes disabled on this instance or am i doing something wrong?

zX41ZdbW2y ago

Done, and now it works perfectly.

1 more reply

zX41ZdbW2y ago

This is unclear to me, I will ask the author.

1 more reply

strooper2y ago

While trying the script, I am getting the following error -

I googled with no luck. I was wondering if you have a solution for it.

zX41ZdbW2y ago

Example of how it should work:

    $ ch -q "SELECT * FROM url('https://hacker-news.firebaseio.com/v0/item/40298680.json')" --format Vertical
    Row 1:
    ──────
    by:     octopoc
    id:     40298680
    parent: 40297716
    text:   Oops, thanks. I guess Marx was being referenced? I had thought Marx was English but apparently he was German-Jewish[1]<p>[1] <a href="https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Karl_Marx" rel="nofollow">https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Karl_Marx</a>
    time:   1715179584
    type:   comment

zX41ZdbW2y ago

Also, a proof that it is updated in real-time: https://play.clickhouse.com/play?user=play#U0VMRUNUICogRlJPT...

minimaxir2y ago

There is a public dataset of Hacker News posts on BigQuery, but it unfortunately has only been updated up to November 2022: https://news.ycombinator.com/item?id=19304326

pfarrell2y ago

remram2y ago

As a starting point, that project has Apache Arrow files. I don't know if they'll update them though.

https://github.com/wilsonzlin/hackerverse/releases/tag/datas...

The comments text table is 13 GB, to give you an idea. Can definitely be processed on a laptop.

noman-land2y ago

I very much support this idea. Put them on ipfs and/or torrents. Put them on HuggingFace.

pfarrell2y ago

I’ve had this same thought but was unsure what the licensing for the data would be.

average_r_user2y ago

that's a nice idea

coolspot2y ago

Absolutely wonderful project and even more so the writeup!

wilsonzlinOP2y ago

swozey2y ago

I'm.. shocked there's been 40 million posts. Wow.

Really neat work

edit: Also had no idea HN went back to 2006. https://news.ycombinator.com/item?id=1

edit2: PG wrote this? https://news.ycombinator.com/item?id=487171

c17r2y ago

An HN "item" is not just posts but everything: posts, comments, the parts of a poll, etc.

Still an impressive number

chossenger2y ago

I stumbled upon [1] using it that reflects your comments on comment sentiment.

This also reminded me of [2] (for which the site itself had rotted away, incidentally) - analysing HN users' similarity by writing style.

[1] https://minimaxir.com/2014/10/hn-comments-about-comments/ [2] https://news.ycombinator.com/item?id=33755016

wilsonzlinOP2y ago

Thanks for the kind words, and raising that problem --- I've added it as an issue to fix.

Thanks for sharing that article, it was an interesting read. It was cool how deep the analysis went with a few simple statistical methods.

datguyfromAT2y ago

What a great read! Thats for taking the time and effort to provide the inside into your process

kriro2y ago

Very nice project and documented really well. I learned a lot reading the post. The examples of the improved HN search are pretty awesome.

Any idea why password reuse is so far away from security? That was the only oddity of the map for me.

chatman2y ago

Worth trying Cagra (Raft)/CuVS and Lucene-CuVS for the vector search. (https://github.com/SearchScale/lucene-cuvs)

NeroVanbierv2y ago

Really love the island map! But the automatic zooming on the map doesn't seem very relevant. E.g. try typing "openai" - I can't see anything related to that query in that part of the map

oersted2y ago

Indeed I've long been intreagued by the idea of rendering such clustering maps more like geographic maps for better readability.

NeroVanbierv2y ago

Ok I just noticed there is a region "OpenAI" in the north-west, but for some reason it zooms in somewhere close to "Apple" (middle of the island) when I type the query

wilsonzlinOP2y ago

Thanks! Yeah sometimes there are one or two "far" away results which make the auto zoom seem strange. It's something I'd like to tune, perhaps zooming to where most but not all results are.

luke-stanley2y ago

celltalk2y ago

It would be cool to see yearly changes of UMAP, by different years or the overall evolution in pseudotime on the embedding. Such a cool side project!

graiz2y ago

Would be cool to see member similarity. Finding like-minded commentors/posters may help discover content that would be of interest.

naveen992y ago

We implemented member similarity in our hacker read app: https://apps.apple.com/in/app/hacker-read/id6479697844

Once you register on ios, you can also login through webapp: https://hn.garglet.com

probably not ready for a hacker news hug of death yet, but you can try.

vsnf2y ago

Reminds me of a similar project a few months ago whose purpose was to unmask alt accounts. It wasn’t well received as I recall.

noman-land2y ago

Accidental dating app.

internetter2y ago

> Accidental dating app.

Possibly the greatest indicator of social startup success.

Lerc2y ago

A suggestion for analysis:

Compare topics/sentiment etc. by number of users and by number of posts.

Are some topics dominated by a few prolific posters? Positively or negatively.

Also, How does one seperate negative/positive sentiment to criticism/advocacy?

How hard is it to detect positive criticism, or enthusiastic endorsement of an acknowledged bad thing?

paddycap2y ago

What you've built is really impressive. I'm excited to see where this goes!

wilsonzlinOP2y ago

Thanks! Yeah if there's enough interested users I'd love to turn this into a live service. Would an email subscription to a set of communities you pick be something you'd be interested in?

tomthe2y ago

I made something very similar a few weeks ago. I also included usernames with the average of their comments: https://tomthe.github.io/hackmap/

xnx2y ago

As a novice, is there a benefit to using custom Node as the downloader? When I did my download of the 40 million Hacker News api items I used "curl --parallel".

What I would like to figure out is the easiest way to go from the API straight into a parquet file.

wilsonzlinOP2y ago

gaauch2y ago

A long term side project of mine is to try to build a recommendation algorithm trained on HN data.

I trained a model to predict if a given post will reach the front page, get flagged etc, I collected over a 1000 RSS feeds and rank the RSS entries with my ranking models.

I submit the high ranking entries on HN to test out my models and I can reach the front page consistently sometimes having multiple entries on the front page at a given time.

I also experiment with user->content recommendation, for that I use comment data for modeling interactions between users and entries, which seems to work fine.

I had been working on this on and off for the last 2 years or so, this account is not my main, and just one I created for testing.

AMA

Foreignborn2y ago

Could you explain more about what you mean by modeling interactions between comments and entities?

saganus2y ago

did you find if submitted entries are more likely to reach the frontpage depending on the title or the content?

i.e. do HN users upvote more based on the title of the article or on actually reading them?

gaauch2y ago

ashu14612y ago

This is pretty great.

Feature request : Is it possible to show in the graph how famous the topic / sub topic / article is ?

So that we can do an educated exploration in the graph around what was upvoted and what was not ?

wilsonzlinOP2y ago

Thanks! Do you mean within the sentiment/popularity analysis graph? Or the points and topics within the map?

ashu14612y ago

Points and topics within the map.

cyclecount2y ago

I can’t tell from the documentation on GitHub: does the API expose the flagged/dead posts? It would be interesting to see statistics on what’s been censored lately.

aeonik2y ago

I couldn't help but notice that Hy is on the map but Clojure isn't.

Am I out of touch?

https://hylang.org

fancy_pantser2y ago

HN submissions and comments are very different on weekends (and US holidays). Your data could explore and quantify this in some very interesting ways!

gsuuon2y ago

This is super cool! Both the writeup and the app. It'd be great if the search results linked to the HN story so we can check out the comments.

nojvek2y ago

I'm impressed with the map component in canvas. It's very smooth, dynamic zoom and google-maps like.

Gonna dig more into it.

Exemplary Show HN! We need more of this.

sourcepluck2y ago

Where is lisp?! I thought it was a verifiable (urban) legend around these parts that this forum is obssessed with lisp..?

pinkmuffinere2y ago

Maybe lisp is so niche that even a rather small interest makes HN relatively lispy?

gitgud2y ago

Very cool! I was hoping to be able to navigate to the HN post from the map though? Is that possible?

gardenhedge2y ago

AI is the most popular topic (by far) that I could find. Is there anything more popular?

dfworks2y ago

If anybody found this interesting and would like some further reading, the paper below employed a similar strategy to analyse inauthentic content/disinformation on Twitter.

https://files.casmconsulting.co.uk/message-based-community-d...

If you would like to read about my largely unsuccessful recreation of the paper, you can do so here - https://dfworks.xyz/blog/partygate/

carte_blanche2y ago

Getting "Argo tunnel error" on the page

wilsonzlinOP2y ago

Thanks for the heads up, just fixed this.

Venkatesh102y ago

This is the type of content I'm here for.

racosa2y ago

Very cool project. Thanks for sharing it!

freediver2y ago

If you have a blog, add an RSS feed :)