One thing that stood out to me was the graph of the sentiment analysis over time, I hadn't seen something like that before and it was interesting to see it for Rust. What were the most positive topics over time? And were there topics that saw very sudden drops?
I also found this sentence interesting, as it rings true to me about social media "there seems to be a lot of negative sentiment on HN in general." It would be cool to see a comparison of sentiment across social media platforms and across time!
The negative sentiment stood out to me mostly because I was expecting a more "clear-cut" sentiment graph: largely neutral-positive, with spikes in the positive direction around positive posts and negative around negative posts. However, for almost all my queries, the sentiment was almost always negative. Even positive posts apparently attracted a lot of negativity (according to the model and my approach, both of which could be wrong). It's something I'd like to dive deeper into, perhaps in a future blog post.
Essentially, I'm not familiar with HuggingFace or any models in this regard. But if they are trained from the socials, then it seems skewed from the start to me.
Also, fully aware that this comment will probably be viewed as negative based on stated assumptions.
edit: reading further down the comments, clearly I'm not the first with these sentiments.
Posts written in sweet syrupy tones wouldn’t do well here, and jokes are in short supply or outright banned. Most people here also seem to be men. There’s always someone shooting you down. And after a while, you start to shoot back.
On HN, my theory is that positivity is the upvotes, and negativity/criticality is the discussion.
Personally, my contribution to your effort is that I would love to see a tool that could do this analysis for me over a dataset/corpus of my choosing. The code is nice, but it is a bit beyond me to follow in your footsteps.
Also time zones and weekday/weekend.
My unsupported take is that engineers are mostly critical, but will +1 positive feedback instead of repeating it, as they might for critism :)
This may be a personal style difference, but I find HN to be the least toxic of all social media I’ve tried. LinkedIn would be my example of ultra toxicity – the aggressive positivity there is unbearable. At least on HN people tell you what they think and even use a constructive decently argumented approach to doing so.
HN to me feels like a good technical discussion where people tear apart ideas instead of each other.
But yeah if you put a lot of ego into your ideas, HN must be an awful place to visit.
It may be a cultural thing, but I think many people see negative sentiment as a constructive tool and a demonstration of trust and respect among people who recognize each others as robust and capable peers.
Avoiding it is something you do with people who you believe need special delicacy: whether because they've told you so, because they intimidate you, or because you sense something pitiable and fragile about them.
If you can trust that it's given in good faith, and by the guidelines of HN you are asked to, negative sentiment should be seen as an expression that someone thinks you're a fully capable adult and peer. Personally, I deeply appreciate that it's generally so comfortably shared and received here and would never include "toxicity" in one of my critiques of HN.
It's a surprising thing to read someone say!
(Unless you're thinking of the nastiness that can surface on flamewar topics, but there are numerous means by which those get downranked and displaced, and they're otherwise sparse and easy to avoid.)
I'd suggest using HDBScan to generate hierarchical clusters for the points, then use a model to generate names for interior clusters. That'll make it easy to explore topics out to the leaves, as you can just pop up refinements based on the connectivity to the current node using the summary names.
The groups need more distinct coloring, which I think having clusters could help with. The individual article text size should depend on how important or relevant the article is, either in general or based on the current search. If you had more interior cluster summaries that'd also help cut down on some of the text clutter, as you could replace multiple posts with a group summary until more zoomed in.
I must say though, is it jina or bge-3/flag - the embeddings (and tokenizer?) do not do a good job on tech topics. It's fine for natural words, but searching for tech concepts like "xaml", "simd", etc cause it fall back to tokenizing the inputs and tries to grab similar sounding words.
Also, just some constructive feedback, if there were some way to stop it from showing the same "hn leaderboard" of results when there are no results because a topic is too niche would be nice. I get a lot of "Stephen Hawking has died" when searching for words the embeddings aren't familiar with.
Edit: I'm not so sure how well the sentiment analysis is working. I had the feeling that there was too much negative sentiment that didn't match up to reality, so I tried looking up things HN would feel overwhelmingly positive about like "Mr Rogers", I mean, who could feel negatively about him? The results show some serious negative spikes. Look up "Carter" and there's a massive negative peak associated with the passing of Rosalynn Carter. It was an HN submission talking about all the wonderful things the Carters did.
Also, I think the "popularity over time" needs to be scaled by the median number of votes a story got that month/year, because the trend lines just go up and up if you plot strictly the number of posts. Look at the popularity of "diesel" and you'll see what I mean - this is a term that peaked ten years ago! Or perhaps it should be some sort of keyword incidence rate or number of items with a cosine similarity index of less than x from the query rather than post score, maybe?
Edit2: The dynamic "click a post to remove and recalculate similarity threshold" is awesome.
Obviously the scale of OP's project adds a lot of interesting complexity, this tool cannot handle that, but it's great for medium-sized datasets.
I define self promotion on HN as a "Show HN: I ..." post vs "Show HN: Something ..."
Examples from the top 100 right now
* "Show HN: Exploring HN by mapping and analyzing 40M posts and comments for fun"
* "Show HN: Browser-based knitting (pattern) software"
These are not self promotional titles. The subjects are the exploration and the software respectively.
* "Show HN: I built a non-linear UI for ChatGPT"
* "Show HN: I created 3,800+ Open Source React Icons"
These are self promotional titles. The subject of each is "I"
My own simple check just via algolia search results checking for titles that start with "Show HN: I" gave these results for years starting April 1st. Graphed divided by the total number of results for that year
2023 ****************************************
2022 ***********************************
2021 ***************************
2020 **************************************
2019 *************************
2018 *************
2017 *******
2016 **********
2015 ********
2014 ************
2013 *********************
2012 *****************
2011 *********
2010 ***
I feel like maybe I grew up in a time when generally, self promotion was considered a bad character trait. Your actions are supposed to be what promotes you, calling attention to them is not but I feel that culture is changing.I wonder if the rise in self promotion (assuming there is a rise) has to do with social media etc...
I perceive a similar rise on Youtube but I have no data, just a feeling from the number of youtube recommendations for videos of "I....."
So what you consider to be self promotion vs non-self-promotion, I consider to be self promotion with a title that very clearly indicates that vs self promotion with a title that less clearly indicates that. However, the "Show HN" phrase is only used for self promotion I think, so even without the "I", anyone familiar with the convention will know it's self promotion.
I think that's an extremely cynical view though a common one. I've never thought of "Show HN" as self promotion if it doesn't include "I" unless I go through to the actual product/library/post and find it full of self promotion. I agree with you that a post that doesn't include "I" can be self promotion but I don't think it always is even if the person made/worked on it.
"Show HN: XYZ and LLM library in rust" to me is informational. It's point is, more often than not, to inform people of something they might get use out of. I know that's true when I've posted something like that. It's meaning is "here's a useful resource that was just created". Sure I get pleasure from knowing I helped people with something but I'm not trying to promote myself, I'm trying to promote the library/post/info.
"Show HN: I made an LLM Library in rust" to me is self promotional. It might be useful to others but it's intent was clearly self promotion given the subject is "I", not the library/post/product.
They are all “look, I made something cool, what do you think?”
(That's not a dig, I think it's a good idea.)
2) are you embedding the search phrases word by word? And using the same model as the documents used? Because I searched for „lead generation“ which any decent non-unigram embedding should understand, but I got results for lead poisoning.
The downside is that the implementation in the Python UMAP package isn't great and creates/pushes the whole expanded node/edge dataset to the GPU, which means you can only train it on about 100k embeddings before going OOM.
The UMAP -> HDBSCAN -> AI cluster labeling pipeline that's all unsupervised is so useful that I'm tempted to figure out a more scalable implementation of Parametric UMAP.
EDIT: looking through the docs it's just GPU-acceletated UMAP, not a parametric UMAP which trains a NN model. That's easy to work around though by training a new NN model to predict the reduced dimensionality values and minimizing rMSE.
There have been attempts at a PyTorch port of Parametric UMAP (https://github.com/lmcinnes/umap/issues/580) but nothing as good.
They used 150 GPUs and developed two custom systems (db-rpc and queued) for inter-server communication, and this was just to compute the embeddings, there's a lot of other work and computation surrounding it.
I'm curious about the context of the project, and how someone gets this kind of funding and time for such research.
PS: Having done a lot of similar work professionally (mapping academic paper and patent landscapes), I'm not sure if 150 GPUs were really needed. If you end up just projecting to 2D and clustering, I think that traditional methods like bag-of-words and/or topic modelling would be much easier and cheaper, and the difference in quality would be unnoticeable. You can also use author and comment-thread graphs for similar results.
Do you have any links to your work? They sound interesting and I'd like to read more about them.
But your technical skill is obvious and very impressive.
If you want to read more, my old bachelor's thesis is somewhat related, from when we only had word embeddings and document embeddings were quite experimental still: https://ad-publications.cs.uni-freiburg.de/theses/Bachelor_J...
I've done a lot follow-up work in my startup Scitodate, which includes large-scale graph and embedding analysis, but we haven't published most of it for now.
As far as funding/time, one possibility is they are between endeavors/employment and it’s self funded as they have had a successful career or business financially. They were very efficient at the GPU utilization so it probably didn’t cost that much.
(2) I apply classical ML (say probability calibrated SVM) to embeddings like that and get good results for classification and clustering at speeds over 100x fine-tuning an LLM.
My HN recommender works fine just using decision trees and XGBoost FWIW. I'll bet SVM would work great too.
Typically the vectors are normalized, instead of what's shown in this demonstration.
When using normalized vectors, the euclidean distance measures the distance between the two end points of the respective vectors. While the cosine distance measures the length of one vector projected onto the other.
I set out with similar hypotheses and goals like you (on a slightly different scale though, haha) but I've been completely stuck on the interactive map part. Definitely getting a lot of pointers from how you handled this!
Maybe one key difference in approach is that I've put more emphasis on trying to extract key topics as keywords.
For ex:
article (title): "Useful Uses of cat"
keywords: ['Software design', 'Contraction', 'Code changes', 'Modularity', 'Ease of extension']
My hypothesis is this will be a faster search solution than using the embeddings, but potentially not as accurate. Not that far yet to really prove this though.
Would love to hear what you think! Any other cool ideas on what could be done with the keywords? I explain my process a bit more here if interested: https://hackernews-demo.streamlit.app/#data-aggregation-meth...
Is it possible to keep it up to date?
Table definition:
CREATE TABLE hackernews_history
(
update_time DateTime DEFAULT now(),
id UInt32,
deleted UInt8,
type Enum('story' = 1, 'comment' = 2, 'poll' = 3, 'pollopt' = 4, 'job' = 5),
by LowCardinality(String),
time DateTime,
text String,
dead UInt8,
parent UInt32,
poll UInt32,
kids Array(UInt32),
url String,
score Int32,
title String,
parts Array(UInt32),
descendants Int32
)
ENGINE = MergeTree(update_time) ORDER BY id;
A shell script: BATCH_SIZE=1000
TWEAKS="--optimize_trivial_insert_select 0 --http_skip_not_found_url_for_globs 1 --http_make_head_request 0 --engine_url_skip_empty_files 1 --http_max_tries 10 --max_download_threads 1 --max_threads $BATCH_SIZE"
rm -f maxitem.json
wget --no-verbose https://hacker-news.firebaseio.com/v0/maxitem.json
clickhouse-local --query "
SELECT arrayStringConcat(groupArray(number), ',') FROM numbers(1, $(cat maxitem.json))
GROUP BY number DIV ${BATCH_SIZE} ORDER BY any(number) DESC" |
while read ITEMS
do
echo $ITEMS
clickhouse-client $TWEAKS --query "
INSERT INTO hackernews_history SELECT * FROM url('https://hacker-news.firebaseio.com/v0/item/{$ITEMS}.json')"
done
It takes a few hours to download the data and fill the table.I saw recursive cte blog post..but this doesn't seem to work your hn dataset
https://play.clickhouse.com/play?user=play#V0lUSCBSRUNVUlNJV...
Are recursive ctes disabled on this instance or am i doing something wrong?
<Trace> ReadWriteBufferFromHTTP: Failed to make request to 'https://hacker-news.firebaseio.com/v0/item/40298680.json'. Error: Timeout: connect timed out: 216.239.32.107:443. Failed at try 3/10. Will retry with current backoff wait is 200/10000 ms.
I googled with no luck. I was wondering if you have a solution for it.
https://github.com/wilsonzlin/hackerverse/releases/tag/datas...
The comments text table is 13 GB, to give you an idea. Can definitely be processed on a laptop.
Feedback: on my iOS phone, once you select a dot on the map, there is no way to unselect it. Preview card of some articles takes full screen, so I can’t even click to another dot. Maybe add a “cross” icon for the preview card or make that when you tap outside of a card, it hides whole card strip?
Really neat work
edit: Also had no idea HN went back to 2006. https://news.ycombinator.com/item?id=1
edit2: PG wrote this? https://news.ycombinator.com/item?id=487171
Still an impressive number
I stumbled upon [1] using it that reflects your comments on comment sentiment.
This also reminded me of [2] (for which the site itself had rotted away, incidentally) - analysing HN users' similarity by writing style.
[1] https://minimaxir.com/2014/10/hn-comments-about-comments/ [2] https://news.ycombinator.com/item?id=33755016
Thanks for sharing that article, it was an interesting read. It was cool how deep the analysis went with a few simple statistical methods.
Any idea why password reuse is so far away from security? That was the only oddity of the map for me.
It would be cool to have analogous continents, countries, sub-regions, roads, different-sized settlements, and significant landmarks... This version looks great at the highest zoom level, but rapidly becomes hard to interpret as you zoom in, same as most similar large embedding or graph visualizations.
Once you register on ios, you can also login through webapp: https://hn.garglet.com
probably not ready for a hacker news hug of death yet, but you can try.
Possibly the greatest indicator of social startup success.
Compare topics/sentiment etc. by number of users and by number of posts.
Are some topics dominated by a few prolific posters? Positively or negatively.
Also, How does one seperate negative/positive sentiment to criticism/advocacy?
How hard is it to detect positive criticism, or enthusiastic endorsement of an acknowledged bad thing?
What you've built is really impressive. I'm excited to see where this goes!
What I would like to figure out is the easiest way to go from the API straight into a parquet file.
As for the Arrow file, I'm not sure unfortunately. I imagine there are some difficulties because the format is columnar, so it probably wants a batch of rows (when writing) instead of one item at a time.
I trained a model to predict if a given post will reach the front page, get flagged etc, I collected over a 1000 RSS feeds and rank the RSS entries with my ranking models.
I submit the high ranking entries on HN to test out my models and I can reach the front page consistently sometimes having multiple entries on the front page at a given time.
I also experiment with user->content recommendation, for that I use comment data for modeling interactions between users and entries, which seems to work fine.
Only problem I have is that I get a lot of 'out of distribution' content in my RSS feeds which causes my ranking models to get 'confused' for this I trained models to predict if a given entry belongs HN or not. On top of that I have some tagging models trained on data I scraped from lobste.rs and hand annotated.
I had been working on this on and off for the last 2 years or so, this account is not my main, and just one I created for testing.
AMA
i.e. do HN users upvote more based on the title of the article or on actually reading them?
Feature request : Is it possible to show in the graph how famous the topic / sub topic / article is ?
So that we can do an educated exploration in the graph around what was upvoted and what was not ?
Am I out of touch?
Gonna dig more into it.
Exemplary Show HN! We need more of this.
https://files.casmconsulting.co.uk/message-based-community-d...
If you would like to read about my largely unsuccessful recreation of the paper, you can do so here - https://dfworks.xyz/blog/partygate/
Turns out, there's only 1 post so far on his blog.
Hoping for more! This one is great.
I'm not in the ML/AI arena yet, so I couldn't fully understand the second half of the article except for having a general idea about Embeddings and their potential, but the first part is what interests me as a software engineer.
Following are some of the challenges the author came across, was able to overcome each of them, and published the full source code.
Downloading HN database
> There's also a maxitem.json API, which gives the largest ID. As of this writing, the max item ID is over 40 million. Even with a very nice and low 10 ms mean response time, this would take over 4 days to crawl, so we need some parallelism.
> I've exported the HN crawler [1] (in TypeScript) to its own project, if you're ever in need to fetch HN items.
Fetching and parsing linked URLs' HTML for metadata and text
> For text posts and comments, the answer is simple. However, for the vast majority of link posts, this would mean crawling those pages being linked to. So I wrote up a quick Rust service [2] to fetch the URLs linked to and parse the HTML for metadata (title, picture, author, etc.) and text. This was CPU-intensive so an initial Node.js-based version was 10x slower and a Rust rewrite was worthwhile. Fortunately, other than that, it was mostly smooth and painless, likely because HN links are pretty good (responsive servers, non-pathological HTML, etc.).
Recovering missing/dead links
> A lot of content even on Hacker News suffers from the well-known link rot: around 200K resulted in a 404, DNS lookup failure, or connection timeout, which is a sizable "hole" in the dataset that would be nice to mend. Fortunately, the Internet Archive has an API that we can use to use to programmatically fetch archived copies of these pages. So, as a final push for a more "complete" dataset, I used the Wayback API to fetch the last few thousands of articles, some dating back years, which was very annoying because IA has very, very low rate limits (around 5 per minute).
Finding a cost-effective cloud provider for GPUs
> Fortunately, I discovered RunPod, a provider of machines with GPUs that you can deploy your containers onto, at a cost far cheaper than major cloud providers. They also have more cost-effective GPUs like RTX 4090, while still running in datacenters with fast Internet connections. This made scaling up a price-accessible option to mitigate the inference time required.
This is the type of content that makes HN stands out from the crowd.
_____________________________
1. https://github.com/wilsonzlin/crawler-toolkit-hn/
2. https://github.com/wilsonzlin/hackerverse/tree/master/crawle...
A Peek inside HN: Analyzing ~40M stories and comments