LLM-based sentiment analysis of Hacker News posts between Jan 2020 and June 2023 (opens in new tab)

(outerbounds.com)

126 pointsmochomocha1y ago72 comments

72 comments

Is this just using LLM to be cool? How does pure LLM with basic "In the scale between 0-10 ..." prompt stack up against traditional, battle-tested sentiment analysis tools?

Gemini suggests NLTK and spaCy

https://www.nltk.org/

https://spacy.io/

antihipocrat1y ago

I'm wondering how their LLM parsing 250 mil words in 9 hours compares with performance of traditional sentiment analysis.

Also, many exisiting sentiment analysis tools have a lot of research behind them that can be referenced when interpreting the results (known confounds etc). I don't think there is yet an equivalent for the LLM approach

Mockapapella1y ago

Pretty slow. I built a sentiment analysis service (https://classysoftware.io/) and 250M words @ ~384 words per message I’m pushing 5.6 hours to crunch all that data, and even at that I’m pretty sure there are ways to push it lower without sacrificing on accuracy.

throwawaymaths1y ago

And yet, it's so much easier to deploy an LLM, either through a service or on prem.

tantalor1y ago

It's easier to do a lot of things. That doesn't make it better.

3 more replies

JimDabell1y ago

What do you mean? Deploying something like spaCy is far easier than deploying an LLM in my experience.

2 more replies

anonylizard1y ago

Because LLMs WILL dominate all NLP use cases, whether you like it or not.

Its like the linux of operating systems. Sure you can handwrite up some custom OS more specialized for a purpose. But its much easier to just use linux, which everyone understands on a basic level and is extremely robust, and modifying it slightly for the end goal.

And saying "Traditional sentiment analysis" tools are "Battle tested" is laughable. LLMs in the past year alone, probably has 1000x the cumulative usage of all sentiment analysis tools in history.

LLMs get 100 billion + each year in research, improvements, engineering, optimisations.

LLMs keep rapidly improving year to year in capabilities. Sonnet 3.5 already obliterates the original GPT-4 in every aspect.

LLMs keep getting cheaper year to year. Gemini flash is like 100x cheaper than the original GPT3.5.

You can onboard any person who can write python, to start using LLMs to perform language analysis in a day. Versus weeks to use these traditional tools.

Nearly all NLP tasks will be standardised to use LLMs as the baseline default tool. Sure there'll be some short term degradations in some specific aspect, but there's no stopping the tide.

By the way, traditional ML-based translation is also pretty much dead and replaced by LLMs. I've been seeing an explosion in fan-translations done by say Sonnet 3.5, the improvement in fluency and accuracy is just radical and extreme, I often don't even notice the AI-translation anymore.

daveguy1y ago

Aside from a half dozen or so zeros, you're right on.

rsanek1y ago

on what, the spending? Facebook alone said that it will spend $40b this year on AI. probably not all of it is on Llama but a sizable portion is.

RamblingCTO1y ago

Sorry, but not really. If you know what you do, you don't just pick an LLM. LLMs are trained/built for a specific task: text generation. Other models are trained on different tasks. If you know what you do, you compare models (I don't mean LLM models with that!) and choose the best performing. Just because LLMs receive more training doesn't mean they have a better performance. Very weird and flawed way of thinking. This is just hype thinking

hitradostava1y ago

I have to agree with the parent. LLMs are excellent at a large range of NLP tasks. Of course they are not going to replace all ML models, but when it comes to NLP they are clearly better than lots of trained models (e.g. https://arxiv.org/pdf/2310.18025).

1 more reply

aussieguy12341y ago

I can see a simliarity here in comparing Java/JavaScript/any other modern, more productive language to C. Yes, both can write more or less the same program, but you'll get the same result with less effort and more quickly with the modern languages. Yes, modern languages will always be slower and heavier on resources than C.

anonylizard1y ago

That's inaccurate, because traditional sentiment analysis, or rather the entire NLP ecosystem, is a very niche and underoptimized space.

Its not comparing C against Javascript, its comparing Ada against Javascript. Ada is not going to be any faster than javascript because its too niche and therefore underoptimised.

The theoretical minimum computation required by LLMs is far higher than traditional simple NLP algorithms. But the practical computation cost of LLMs will soon be cheaper, because LLMs get so much investment and use, there's massive full-stack optimizations all the way from the GPU to the end libraries.

visarga1y ago

I did a similar kind of process for my own chat logs. I have about 11M tokens worth of logs, and it took 2 days to crunch all of them with ollama and LLaMA 3.1 8B on my MacBook. It's slow, but free.

I generated title, summary, keywords and hierarchical topics up to 3 levels up from the original text. My plan for now is to put them in a vector search engine, which, incidentally, was made with Sonnet 3.5 with very little iteration. I want to play around to see how I can organize my ideas with LLMs, make something useful from all that text.

I really don't know what I will discover. One small insight I already found is that summarization works really well, you can use summaries instead of full texts to prime Claude and it works better than expected. Unlimited context? Maybe.

Another direction of research is to create a nice taxonomy, there are thousands of topics, pretty difficult task, but there must be a way using clustering and LLMs. That is why I generated topic, parent-topic, gp-topic, and ggp-topic from all snippets. I would probably manually edit the top 2 levels of the taxonomy to give it the right focus.

I'm also integrating with my HN and reddit feeds. X is too stingy with the API. Maybe Pocket and local downloads folder too, I save/bookmark stuff I like. I could also include all the papers I am reading into the corpus. It could synthesize a ranked feed aligned to my own interests.

ma9o1y ago

I'm working on something tangentially related [1] but by sourcing my Google search history data. It's surprising how LLaMA 3.1 8B is pulling most of the weight in my case too.

[1] https://github.com/enclaveid/enclaveid

mithametacs1y ago

LLMs are shit at generating content, but summarization works really well.

I’d like to use your project

huem0n1y ago

> NFL (915 posts)

> Football (206 posts)

Either hacker news really likes the national forensic league, or these LLM-categories are a bit dubious.

Also hmmm:

> American football (7 posts)

> American_football (6 posts)

winddude1y ago

It's this one "these LLM-categories are a bit dubious", specialized models still outperform LLMs on niche tasks like classification and sentiment.

EarthLaunch1y ago

> Tokens Don't Lie

> But how do people feel about these topics

I find it notable that tokens don't necessarily express people's feelings. Put another way, tokens aren't how people feel, they're how they write.

Samstave mentioned in this thread that Twitter is a 'global sentiment engine'. I'm sure that's literally true. Sentiment measurement is only accurate to the degree that people are expressing their real feelings via tokens. I can imagine various psychological and political reasons for a discrepancy.

If you did sentiment analysis of publicly known writings of North Korean administrators, would that represent their feelings?

I think the interplay with free speech is interesting here: In a setting where people feel socially and legally safe to express their true opinion, sentiment analysis will be more accurate.

adsharma1y ago

Can you run this tool on the removed posts dataset?

https://github.com/vitoplantamura/HackerNewsRemovals

jmward011y ago

I wonder if the dip is more about LLama3 70b training and data than a change in sentiment. The data cut-off was Dec 2023 for 70b. That looks to coincide with the reversal of the dip.

vtuulos1y ago

That's an interesting hypothesis but the words we use to express agreement and disagreement haven't changed much.

We don't try to retrieve articles/topics from the model, which would be affected by the cutoff, just asking it to analyze the sentiment or summarize the content provided in a prompt

jmward011y ago

True. It would be interesting to run these same tests on the 7B model to see if trend information changes or not. 7B had a march cutoff so if the aug-dec dip migrated to oct-march (or just disappeared) it would be strong evidence for training/data bias. If nothing else, comparing 7B to 70B would likely be interesting.

edit I realized too late I had the years off. It is pure coincidence of month, not a real data bias. Sorry! I still think it would be interesting to see a 7B comparison but that is just to see how well a small model could spot big trends compared to a bigger one.

vtuulos1y ago

yep! And of course the new 3.1 model

samstave1y ago

>>Use the tool below to explore various topics and the sentiments they evoke.

This is a cool phrase.

It is personally important as when I was asked in a panel interview @ -- They asked "what do you think Twitter is?

My response was "You're a global sentiment engine""

(There are a lot of conversations I'd love to have with the HN community with respect to our shared experiences, and weird history flipped-bits that exists in the minds of those who experienced that...

like threads of how linux came, or how xml was born through things I touched in a forrest gump way - and how there are so many stories from so many.

MBCook1y ago

Speaking of Twitter, it would be very neat to be able to see a graph of sentiment over time if you select a term.

You could watch Twitter go from being a niche little new thing to popular to "twitter is trash" too popular to increasingly divisive to the purchase and rename to X to today.

throwup2381y ago

> My response was "You're a global sentiment engine""

More like a sentiment engine for bot operators.

1 more reply

SubiculumCode1y ago

I wanted to do an analysis of hacker news on another topic, but over a longer timespan.

I started to look into it, but in the little time I had to devote to the idea, I read that the Agolia API lets you look over a longer period, but that it is relatively costly.

I just want to look for all story titles from the beginning of time which match one of several simple search terms, and return submission date and title for an analysis I'd conduct in R.

Am I overthinking it and a simple Python script without an API code can do it?

vtuulos1y ago

even simpler, you can just do it in SQL

You can find all titles and dates since the beginning of HN in this public BigQuery dataset: https://console.cloud.google.com/marketplace/product/y-combi...

SubiculumCode1y ago

whoah. thank you dude!!

lz4001y ago

It's funny filtering by crypto and seeing the (sometimes hazy) division between cryptography (we love this) and cryptocurrency (we hate it) terms.

chazeon1y ago

I wonder if using prompts to get the sentiment in LLM is enough? So we do not need to do any fine-tuning anymore?

t-writescode1y ago

I think you raise a reasonable question.

I also think it *could* be less of a problem than you might think. If we treat the scale as arbitrary (which I think is a safe thing to do), then movement along the scale could be sufficient to ascertain *something*

synicalx1y ago

> Hate : Torture

Great work folks, glad we can all agree on that one.

Interesting that they used an LLM for this. I mean it makes sense and the data seems to pass the pub test but I, in my ignorance, would not have assumed that a language model would be well suited for number crunching.

silisili1y ago

Seems we mostly agree on hating Atlassian, too, so it's working as intended.

Hamuko1y ago

Conveniently sandwiched between War on Terror and CSAM.

Sleaker1y ago

Why is everything only plotted between 4 and 8 if the scale of the least liked topic should be 0 and most liked should be 9. Also 4.5 is the midpoint, but 4 is displayed as bright red and 6 is a muted gray blue, why? This makes no sense except to be psychologically disingenuous.

And no 5s? What is even going on in that LLM?

sebastiennight1y ago

> "It's a scale of 1 to 13, but it goes up and back down. Eight is the highest score on the scale." - Jason Mendoza

It's nice to see this scale used outside of The Good Place.

teo_zero1y ago

The scale makes no sense.

Sentiment of forum posts is not an absolute value, you can't compare it against, for example, conversations in a pub, or talks between friends, etc.

I think they should have normalized the numbers around the average, so to have a relative measurement of the various topics.

Mathnerd3141y ago

> Reply only the tags

LLM's are really sensitive to bad or even slightly ambiguous grammar. I wonder if the numbers would differ significantly with "Reply only with the tags, in the following format".

vtuulos1y ago

I had the same concern. However, the structure of the output was surprisingly stable. We rejected badly formatted responses: https://github.com/outerbounds/hacker-news-sentiment/blob/ma...

The semantics of the topics/tags could be improved for sure with a more detailed prompt

anonu1y ago

At least Republicans and Democrats share the same low sentiment score of 4.

qwerpy1y ago

Apparently your comment is divisive though!

savin-goyal1y ago

what's up with the title flips from

> 350M Tokens Don't Lie: Love And Hate In Hacker News, to

> LLM-based sentiment analysis of Hacker News posts, to

> LLM-based sentiment analysis of Hacker News posts between Jan 2020 and June 2023

t-writescode1y ago

A/B testing? Possibly increasing accuracy from high click-bait, low signal to low click-bait high signal?

bravura1y ago

Can we get a 2-d visualization of topics, and drill into topics?

vtuulos1y ago

yes please! The data is conveniently available as JSON blobs here https://github.com/outerbounds/hacker-news-sentiment/tree/ma...

elashri1y ago

> It is worth clarifying though that Hacker News does not hate International Students, but the posts related to them tend to be overwhelmingly negative, reflecting the community’s sympathy for the challenges faced by those studying abroad.

I was horrified when I read international students as one of top on the hate list. Although I saw a couple of comments attributed their cities housing crises on international students and thought that this sentiment is wide supported.

vtuulos1y ago

here's how the model ranks the discussion on this page after 40 comments:

SENTIMENT 6

anonu1y ago

Great analysis. How is divisiveness actually calculated?

ysofunny1y ago

the most divisive topic seems to be "gnome" with 0.82 on the divisiveness scale

that's really "hacker", a worthy first place

anonu1y ago

more like h4x0r

vtuulos1y ago

search "divisive" here: https://github.com/outerbounds/hacker-news-sentiment/blob/ma...

I actually spent 10 minutes trying to see if there are obvious tests for U-shaped distributions. I'd love to hear if anyone has ideas here.

thr0w1y ago

I don't know about this analysis and its conclusions. I'll just use this as a jumping point to selfishly spout my own human observations.

For context, I'm someone who uses HN to search for topics I'm interested in, rather than something like Google or Reddit.

- For anything SF community-related, most hits are from 10+ years ago. Lots of "hey we have a space in soma, any local startups want to hang and drink beers?" or "we have an empty desk in a space in the mission, any hackers want to grab it for free?" - all from around 2012 or prior. Nothing like that seems to happen anymore.

- Starting from around 2016, a heavy anti-technology sentiment appears. Cloud, crypto, AI - all are nonsense propagated by VC types and overzealous engineers.

- Similarly, any thread involving money/labor invariably has an anti-capitalist and/or "unions would solve everything" tangent.

Would be interested to hear if others have observed similar.

Karrot_Kream1y ago

Yeah that’s roughly been my read too. I think the audience of the site changed. The user base has grown significantly. The site has gone from being about hacking (“hey here’s an empty desk”) to the culture of hackers at large (“tech was a mistake when it got invaded by VC hucksters”.)

TFA’s sentiment decrease tracks very closely with the huge uptick in user creation that started in 2022. HN isn’t really a tech site anymore, it’s about vibes. That makes sense given that in 2024 there’s a million places online talking about tech so HN only has its culture to distinguish itself. This wasn’t the case in 2008. The vibes here, along with the older demographics of the site, are increasingly nostalgic and cynical.

It'll all probably go the same way as Slashdot did which went through the same cycle (replace "VC huckster" with "Microsoft" and "surveillance capitalism" with "three letter agencies") until it too gets replaced by a site/community with energetic younger users creating new things.

teleforce1y ago

Systemd now in the Love HN section, that a HN news in itself.

j / k navigate · click thread line to collapse

72 comments

tantalor1y ago

Is this just using LLM to be cool? How does pure LLM with basic "In the scale between 0-10 ..." prompt stack up against traditional, battle-tested sentiment analysis tools?

Gemini suggests NLTK and spaCy

https://www.nltk.org/

https://spacy.io/

antihipocrat1y ago

I'm wondering how their LLM parsing 250 mil words in 9 hours compares with performance of traditional sentiment analysis.

Mockapapella1y ago

throwawaymaths1y ago

And yet, it's so much easier to deploy an LLM, either through a service or on prem.

tantalor1y ago

It's easier to do a lot of things. That doesn't make it better.

3 more replies

JimDabell1y ago

What do you mean? Deploying something like spaCy is far easier than deploying an LLM in my experience.

2 more replies

anonylizard1y ago

Because LLMs WILL dominate all NLP use cases, whether you like it or not.

And saying "Traditional sentiment analysis" tools are "Battle tested" is laughable. LLMs in the past year alone, probably has 1000x the cumulative usage of all sentiment analysis tools in history.

LLMs get 100 billion + each year in research, improvements, engineering, optimisations.

LLMs keep rapidly improving year to year in capabilities. Sonnet 3.5 already obliterates the original GPT-4 in every aspect.

LLMs keep getting cheaper year to year. Gemini flash is like 100x cheaper than the original GPT3.5.

You can onboard any person who can write python, to start using LLMs to perform language analysis in a day. Versus weeks to use these traditional tools.

Nearly all NLP tasks will be standardised to use LLMs as the baseline default tool. Sure there'll be some short term degradations in some specific aspect, but there's no stopping the tide.

daveguy1y ago

Aside from a half dozen or so zeros, you're right on.

rsanek1y ago

on what, the spending? Facebook alone said that it will spend $40b this year on AI. probably not all of it is on Llama but a sizable portion is.

RamblingCTO1y ago

hitradostava1y ago

1 more reply

aussieguy12341y ago

anonylizard1y ago

That's inaccurate, because traditional sentiment analysis, or rather the entire NLP ecosystem, is a very niche and underoptimized space.

Its not comparing C against Javascript, its comparing Ada against Javascript. Ada is not going to be any faster than javascript because its too niche and therefore underoptimised.

visarga1y ago

I did a similar kind of process for my own chat logs. I have about 11M tokens worth of logs, and it took 2 days to crunch all of them with ollama and LLaMA 3.1 8B on my MacBook. It's slow, but free.

ma9o1y ago

I'm working on something tangentially related [1] but by sourcing my Google search history data. It's surprising how LLaMA 3.1 8B is pulling most of the weight in my case too.

[1] https://github.com/enclaveid/enclaveid

mithametacs1y ago

LLMs are shit at generating content, but summarization works really well.

I’d like to use your project

huem0n1y ago

> NFL (915 posts)

> Football (206 posts)

Either hacker news really likes the national forensic league, or these LLM-categories are a bit dubious.

Also hmmm:

> American football (7 posts)

> American_football (6 posts)

winddude1y ago

It's this one "these LLM-categories are a bit dubious", specialized models still outperform LLMs on niche tasks like classification and sentiment.

EarthLaunch1y ago

> Tokens Don't Lie

> But how do people feel about these topics

I find it notable that tokens don't necessarily express people's feelings. Put another way, tokens aren't how people feel, they're how they write.

If you did sentiment analysis of publicly known writings of North Korean administrators, would that represent their feelings?

I think the interplay with free speech is interesting here: In a setting where people feel socially and legally safe to express their true opinion, sentiment analysis will be more accurate.

adsharma1y ago

Can you run this tool on the removed posts dataset?

https://github.com/vitoplantamura/HackerNewsRemovals

jmward011y ago

I wonder if the dip is more about LLama3 70b training and data than a change in sentiment. The data cut-off was Dec 2023 for 70b. That looks to coincide with the reversal of the dip.

vtuulos1y ago

That's an interesting hypothesis but the words we use to express agreement and disagreement haven't changed much.

We don't try to retrieve articles/topics from the model, which would be affected by the cutoff, just asking it to analyze the sentiment or summarize the content provided in a prompt

jmward011y ago

vtuulos1y ago

yep! And of course the new 3.1 model

samstave1y ago

>>Use the tool below to explore various topics and the sentiments they evoke.

This is a cool phrase.

It is personally important as when I was asked in a panel interview @ -- They asked "what do you think Twitter is?

My response was "You're a global sentiment engine""

like threads of how linux came, or how xml was born through things I touched in a forrest gump way - and how there are so many stories from so many.

MBCook1y ago

Speaking of Twitter, it would be very neat to be able to see a graph of sentiment over time if you select a term.

You could watch Twitter go from being a niche little new thing to popular to "twitter is trash" too popular to increasingly divisive to the purchase and rename to X to today.

throwup2381y ago

> My response was "You're a global sentiment engine""

More like a sentiment engine for bot operators.

1 more reply

SubiculumCode1y ago

I wanted to do an analysis of hacker news on another topic, but over a longer timespan.

I started to look into it, but in the little time I had to devote to the idea, I read that the Agolia API lets you look over a longer period, but that it is relatively costly.

I just want to look for all story titles from the beginning of time which match one of several simple search terms, and return submission date and title for an analysis I'd conduct in R.

Am I overthinking it and a simple Python script without an API code can do it?

vtuulos1y ago

even simpler, you can just do it in SQL

You can find all titles and dates since the beginning of HN in this public BigQuery dataset: https://console.cloud.google.com/marketplace/product/y-combi...

SubiculumCode1y ago

whoah. thank you dude!!

lz4001y ago

It's funny filtering by crypto and seeing the (sometimes hazy) division between cryptography (we love this) and cryptocurrency (we hate it) terms.

chazeon1y ago

I wonder if using prompts to get the sentiment in LLM is enough? So we do not need to do any fine-tuning anymore?

t-writescode1y ago

I think you raise a reasonable question.

synicalx1y ago

> Hate : Torture

Great work folks, glad we can all agree on that one.

silisili1y ago

Seems we mostly agree on hating Atlassian, too, so it's working as intended.

Hamuko1y ago

Conveniently sandwiched between War on Terror and CSAM.

Sleaker1y ago

And no 5s? What is even going on in that LLM?

sebastiennight1y ago

> "It's a scale of 1 to 13, but it goes up and back down. Eight is the highest score on the scale." - Jason Mendoza

It's nice to see this scale used outside of The Good Place.

teo_zero1y ago

The scale makes no sense.

Sentiment of forum posts is not an absolute value, you can't compare it against, for example, conversations in a pub, or talks between friends, etc.

I think they should have normalized the numbers around the average, so to have a relative measurement of the various topics.

Mathnerd3141y ago

> Reply only the tags

LLM's are really sensitive to bad or even slightly ambiguous grammar. I wonder if the numbers would differ significantly with "Reply only with the tags, in the following format".

vtuulos1y ago

I had the same concern. However, the structure of the output was surprisingly stable. We rejected badly formatted responses: https://github.com/outerbounds/hacker-news-sentiment/blob/ma...

The semantics of the topics/tags could be improved for sure with a more detailed prompt

anonu1y ago

At least Republicans and Democrats share the same low sentiment score of 4.

qwerpy1y ago

Apparently your comment is divisive though!

savin-goyal1y ago

what's up with the title flips from

> 350M Tokens Don't Lie: Love And Hate In Hacker News, to

> LLM-based sentiment analysis of Hacker News posts, to

> LLM-based sentiment analysis of Hacker News posts between Jan 2020 and June 2023

t-writescode1y ago

A/B testing? Possibly increasing accuracy from high click-bait, low signal to low click-bait high signal?

bravura1y ago

Can we get a 2-d visualization of topics, and drill into topics?

vtuulos1y ago

yes please! The data is conveniently available as JSON blobs here https://github.com/outerbounds/hacker-news-sentiment/tree/ma...

elashri1y ago

vtuulos1y ago

here's how the model ranks the discussion on this page after 40 comments:

SENTIMENT 6

anonu1y ago

Great analysis. How is divisiveness actually calculated?

ysofunny1y ago

the most divisive topic seems to be "gnome" with 0.82 on the divisiveness scale

that's really "hacker", a worthy first place

anonu1y ago

more like h4x0r

vtuulos1y ago

search "divisive" here: https://github.com/outerbounds/hacker-news-sentiment/blob/ma...

I actually spent 10 minutes trying to see if there are obvious tests for U-shaped distributions. I'd love to hear if anyone has ideas here.

thr0w1y ago

I don't know about this analysis and its conclusions. I'll just use this as a jumping point to selfishly spout my own human observations.

For context, I'm someone who uses HN to search for topics I'm interested in, rather than something like Google or Reddit.

- Starting from around 2016, a heavy anti-technology sentiment appears. Cloud, crypto, AI - all are nonsense propagated by VC types and overzealous engineers.

- Similarly, any thread involving money/labor invariably has an anti-capitalist and/or "unions would solve everything" tangent.

Would be interested to hear if others have observed similar.

Karrot_Kream1y ago

teleforce1y ago

Systemd now in the Love HN section, that a HN news in itself.

j / k navigate · click thread line to collapse