HN is in the same cluster as 2ch, not Techcrunch, on Twitter (opens in new tab)

(hella.cheap)

197 pointsrabidsnail10y ago46 comments

46 comments

2d projections of complex multidimensional data are unreliable in the extreme as to adjacency meaning. Most adjacency especially are an artifact of the chosen projection method.

daniel-levin10y ago

This comment got me thinking: in some applications, Euclidean distance between feature vectors acts as a good proxy for adjacency/similarity. For such applications, an isometry from R^n to R^2 or R^3 should in principle preserve the meaning of adjacency. A quick Google yields [0, 1] a technique for quasi-isometric, and isometric dimensionality reduction. This should mitigate artefacts of adjacency, or non-adjacency, as it were. In other words, you might be able to actually pull off good 2D projections of high dimensional data and still see meaningful relationships.

[0] https://en.wikipedia.org/wiki/Isomap

[1] https://www.aaai.org/Papers/AAAI/2007/AAAI07-083.pdf

ecesena10y ago

Sammon mapping is another famous example, see [1] for instance for a nice visualization.

[1] http://homepages.inf.ed.ac.uk/rbf/CVonline/LOCAL_COPIES/AV09...

frozenport10y ago

>> Provides us with a measure of the quality of any given transformed dataset. However, we still need to determine the optimal such dataset, in terms of minimising E. Strictly speaking, this is an implementation detail and the Sammon mapping itself is simply defined as the optimal transformation;

Somehow its technically challenging to verify the content of this article.

1 more reply

rabidsnailOP10y ago

For small distances, yes. If you plot a 2d projection of a dataset that doesn't have much structure you're going to be reading patterns into whitenoise (though this data has some pretty clear clusters, which are probably real). If I were doing something other than writing a fun blog post I would have done cluster analysis with something like DBSCAN.

rryan10y ago

Also, this is t-SNE: https://en.wikipedia.org/wiki/T-distributed_stochastic_neigh...

The S is for "stochastic" -- i.e. you get a different 2D projection every time you run it on the same inputs. Take it with a grain of salt.

thisisdave10y ago

>The S is for "stochastic" -- i.e. you get a different 2D projection every time you run it on the same inputs.

That's not the part that's "stochastic"; sensitivity to initial conditions is just nonconvex optimization in action. You get the same thing with most other local embeddings.

The stochastic bit is that the model is based on optimizing "the asymmetric probability, pij , that i would pick j as its neighbor"[0]. Those probabilities and the associated positions in 2D space are not estimated stochastically (e.g. with Monte Carlo sampling) or anything, though.

[0] https://www.cs.nyu.edu/~roweis/papers/sne_final.pdf

personjerry10y ago

I wonder if I could post a randomly generated graph, label it with HN-interested labels arbitrarily, and get a serious talk started on HN about nonexistent correlations.

hapless10y ago

TechCrunch reports on us. It is journalism for the spectators. The twitter cluster of people sharing TC links is TC's audience, not participants in TC's subject matter.

Why in blue hell would anyone on HN be sharing TC links? Intuitively it seems more likely that people who share HN links are discussing these matters directly.

bitbckt10y ago

Interesting parallel observation: when I worked for a regional newspaper some years ago, we rolled out products for the same demo as "mommy blog Twitter". We saw the same sort of isolated behavior - visitors to "mommy blog content" almost never strayed onto our mainstream products.

The same sorts of products delivered to "puppy and kitty" people didn't have the same effect, though the level of vitriol in the comments was similar.

madaxe_again10y ago

Ditto. Launched (well, we built - client project) a social network for moms nearly a decade ago, and they were Not Interested in anything outside of the core offering - even recipes, which you would have thought would be interesting, weren't - until they rebranded along the lines of "recipes for moms", which changed that interaction overnight.

Some demographics choose tighter filter bubbles for themselves than others, and moms are likely up there, as the single most important thing to mothers tends to be being a mother - it becomes an all-encompassing identity for many.

hkmurakami10y ago

Considering nicovideo is anti-establishment media (it's owned by Kadokawa, which is an underdog media company with strong subculture roots) and that 2chan "summary sites" double as news sources for the anti-establishment these days, the association seems apt.

newobj10y ago

This is amazing, one of my favorite articles on HN ever.

I'm really curious what the heck that "eye" is in the bottom right space of the clusters. Some cluster so radically orthogonal to any other content it has an order of magnitude more distance in differentiation?

rabidsnailOP10y ago

(original author here) it's a spambot network. If you click the link in that post to the interactive version (this: https://pile-of-junk.s3.amazonaws.com/twitter_scatter_10k.ht...) you can see for yourself.

stephenboyd10y ago

This is cool. How many sampled tweets did HN links appear in? How many sampled tweets did you have overall?

I'm curious if a sampling error could explain why an English website like HN would get placed with the Japanese language sites. StackOverflow isn't placed by any related sites either.

If the weird results aren't from sampling artifacts, my best guess is that a lot of spambots must be linking to multiple legit sites regardless of relevance.

brownbat10y ago

I really hope someday we get spambots that start off by trying to make useful contributions. Then later, after building a following, start advertising scams.

I'm confident that, given the right incentives, spam kings could discover conversational AI before any lab.

swerling10y ago

This is fantastic. Feature request: drag a rectangle over a group of dots, and see them as a text list of websites. As is it's hard to see all the sites that are in a dense dot cluster.

TazeTSchnitzel10y ago

Quran quotes being grouped with archive.org might be explained by the Internet Archive frequently being used to host Islamist materials.

runn1ng10y ago

Just today I wondered why are so few journalists picking up the fact that ISIS is using almost exclusively archive.org for uploading their beheading and other PR videos.

i336_10y ago

The interactive version is powered by this dataset - http://pile-of-junk.s3.amazonaws.com/domain_similarity_tsne_... - processed by JavaScript inside the page: https://pile-of-junk.s3.amazonaws.com/twitter_scatter_10k.ht...

wodenokoto10y ago

> Japanese social media twitter (which I'm labelling as "2ch", though it's not just 2ch) is almost completely distinct from what I'm calling "upstanding japanese twitter" (links to mainstream news sites like news24)

I have no idea what the point of the headline is after reading the above part of the post.

Ezhik10y ago

That's interesting. Never would've made the connection myself, although now that I think about it, some of the most fascinating discussions I've read on HN involved Japanese work culture.

ChuckMcM10y ago

This is some fascinating analysis. And like the Author I am amazed that Twitter doesn't crack down harder on their spambots.

n0us10y ago

I've wondered that as well. I'm not "active" on Twitter but I log on occasionally to see if there are any interesting tweets in my feed. Every time I log on I have a new follower from penny stocks twitter, get rich quick schemes, and various other fake profiles. This seems to stay stable at around 20 fake followers as old ones get erased and new ones follow.

It seems like amateurs are more capable at detecting spam than the entire company but I sometimes wonder if they just know about it leave the spam bots because once they crack down, new ones will just pop up. Or if they keep them around at a tolerable level that doesn't drive real users away but still allows them to publish a higher "user count"

egypturnash10y ago

This may also be in part to more active users of Twitter hitting the "report spam" button on those spam bots. If a spambot tweets at me, I'll go do that. I'm sure I'm not the only one, as I never see a spambot with more than a handful of tweets showing up in my mentions.

So, crowdsource spam detection.

username22310y ago

> Or if they keep them around at a tolerable level that doesn't drive real users away but still allows them to publish a higher "user count"

They seem to have figured out that 20 fake accounts is not enough to get you to leave their service.

matheweis10y ago

twitter should hire op, this is some incredible analysis - and I don't think he counts as an amateur.

Also, they are apparently too busy battling isis (http://www.theguardian.com/technology/2016/feb/05/twitter-de...) to deal with the spam issue effectively.

jonesb610y ago

Well it's whack-a-mole isn't it? Take down one spam network and another crops up with an entirely different methodology and signature. If I was managing a large social network that suffered from bots I would whack until I came across an opponent that did the least possible damage, then weaken it through things like shadow bans etc to the point where it won't die but will operate with the bare minimum amount of damage to the network.

jerrickhoang10y ago

I think a more interesting problem is not how you can differentiate a spambot with a 'non-spam' bot. I've seen some bots that are really creative and fun on Twitter. I guess it's not really hard to add it to a spam detection ML model

rabidsnailOP10y ago

Non-spam bots generally don't follow each other or link to external websites. (I'm also the author of one of the more popular image bots https://twitter.com/a_quilt_bot)

surfmike10y ago

what is 2ch?

daodedickinson10y ago

Japanese predecessor of 4chan.

yawawort10y ago

What you're thinking of is Futaba (www.2chan.net). 2ch is text only and would be closer to Reddit than 4chan (at least culturally).

Rayearth10y ago

So HN is close to nico (Japanese youtube) and pixiv (Japanese-centric art and fanart site)? Interesting.

forrestthewoods10y ago

What are all of the other twitters? There is so much undocumented space! I want to know what it all is!

simcop238710y ago

Is the regex search in the demo not working for anyone else (tested both Chrome and Firefox on Win7)

rabidsnailOP10y ago

There's no UI for if there are no matches; it just does nothing. Try searching for \.com or something.

Edit: I patched it so it displays an alert if there are no matches.

simcop238710y ago

I see. That patch makes it a lot nicer to find out that none of the sites i wanted to look for show up in the data :)

kitwalker1210y ago

(Update) see rabidsnail's suggestion

not working for me on Chrome or Safari either

gohrt10y ago

why does the hella.cheap site have an SSL cert with an unknown authority?

tokenizerrr10y ago

It has a COMODO certificate. If you see otherwise you might be getting MITMd.

schoen10y ago

It has a valid Comodo certificate but forgot to include the full certificate chain, which is probably now the #1 configuration error (I help do support for Let's Encrypt and about 80% of "my cert doesn't work after issuance" problems are that). These bugs are tricky because most browsers cache intermediate certs and then forgive sites that don't send intermediates that the browser knows about, so you can see an error in one browser or device and not another because of different cert caches!

kalleboo10y ago

I just ran into this today... A site I manage with a Comodo certificate was showing unknown issuer in Firefox and only Firefox, and I've never had it fail before (and we've never had any user reports). Added in the cert chain, error is gone. Dunno if the other browsers had Comodo as trusted or it's common enough that everyone who regularly uses Firefox (I haven't used it in months) has it cached...

kazazes10y ago

Wouldn't it be more reasonable for browsers to not cache them at all and universally reject missing intermediate certificates? (IIRC correctly, Chrome doesn't mind but Firefox will give you the train conductor)

1 more reply

j / k navigate · click thread line to collapse

46 comments

bhouston10y ago

2d projections of complex multidimensional data are unreliable in the extreme as to adjacency meaning. Most adjacency especially are an artifact of the chosen projection method.

daniel-levin10y ago

[0] https://en.wikipedia.org/wiki/Isomap

[1] https://www.aaai.org/Papers/AAAI/2007/AAAI07-083.pdf

ecesena10y ago

Sammon mapping is another famous example, see [1] for instance for a nice visualization.

[1] http://homepages.inf.ed.ac.uk/rbf/CVonline/LOCAL_COPIES/AV09...

frozenport10y ago

Somehow its technically challenging to verify the content of this article.

1 more reply

rabidsnailOP10y ago

rryan10y ago

Also, this is t-SNE: https://en.wikipedia.org/wiki/T-distributed_stochastic_neigh...

The S is for "stochastic" -- i.e. you get a different 2D projection every time you run it on the same inputs. Take it with a grain of salt.

thisisdave10y ago

>The S is for "stochastic" -- i.e. you get a different 2D projection every time you run it on the same inputs.

That's not the part that's "stochastic"; sensitivity to initial conditions is just nonconvex optimization in action. You get the same thing with most other local embeddings.

[0] https://www.cs.nyu.edu/~roweis/papers/sne_final.pdf

personjerry10y ago

I wonder if I could post a randomly generated graph, label it with HN-interested labels arbitrarily, and get a serious talk started on HN about nonexistent correlations.

hapless10y ago

TechCrunch reports on us. It is journalism for the spectators. The twitter cluster of people sharing TC links is TC's audience, not participants in TC's subject matter.

Why in blue hell would anyone on HN be sharing TC links? Intuitively it seems more likely that people who share HN links are discussing these matters directly.

bitbckt10y ago

The same sorts of products delivered to "puppy and kitty" people didn't have the same effect, though the level of vitriol in the comments was similar.

madaxe_again10y ago

hkmurakami10y ago

newobj10y ago

This is amazing, one of my favorite articles on HN ever.

rabidsnailOP10y ago

stephenboyd10y ago

This is cool. How many sampled tweets did HN links appear in? How many sampled tweets did you have overall?

I'm curious if a sampling error could explain why an English website like HN would get placed with the Japanese language sites. StackOverflow isn't placed by any related sites either.

If the weird results aren't from sampling artifacts, my best guess is that a lot of spambots must be linking to multiple legit sites regardless of relevance.

brownbat10y ago

I really hope someday we get spambots that start off by trying to make useful contributions. Then later, after building a following, start advertising scams.

I'm confident that, given the right incentives, spam kings could discover conversational AI before any lab.

swerling10y ago

This is fantastic. Feature request: drag a rectangle over a group of dots, and see them as a text list of websites. As is it's hard to see all the sites that are in a dense dot cluster.

TazeTSchnitzel10y ago

Quran quotes being grouped with archive.org might be explained by the Internet Archive frequently being used to host Islamist materials.

runn1ng10y ago

Just today I wondered why are so few journalists picking up the fact that ISIS is using almost exclusively archive.org for uploading their beheading and other PR videos.

i336_10y ago

wodenokoto10y ago

I have no idea what the point of the headline is after reading the above part of the post.

Ezhik10y ago

That's interesting. Never would've made the connection myself, although now that I think about it, some of the most fascinating discussions I've read on HN involved Japanese work culture.

ChuckMcM10y ago

This is some fascinating analysis. And like the Author I am amazed that Twitter doesn't crack down harder on their spambots.

n0us10y ago

egypturnash10y ago

So, crowdsource spam detection.

username22310y ago

> Or if they keep them around at a tolerable level that doesn't drive real users away but still allows them to publish a higher "user count"

They seem to have figured out that 20 fake accounts is not enough to get you to leave their service.

matheweis10y ago

twitter should hire op, this is some incredible analysis - and I don't think he counts as an amateur.

Also, they are apparently too busy battling isis (http://www.theguardian.com/technology/2016/feb/05/twitter-de...) to deal with the spam issue effectively.

jonesb610y ago

jerrickhoang10y ago

rabidsnailOP10y ago

Non-spam bots generally don't follow each other or link to external websites. (I'm also the author of one of the more popular image bots https://twitter.com/a_quilt_bot)

surfmike10y ago

what is 2ch?

daodedickinson10y ago

Japanese predecessor of 4chan.

yawawort10y ago

What you're thinking of is Futaba (www.2chan.net). 2ch is text only and would be closer to Reddit than 4chan (at least culturally).

Rayearth10y ago

So HN is close to nico (Japanese youtube) and pixiv (Japanese-centric art and fanart site)? Interesting.

forrestthewoods10y ago

What are all of the other twitters? There is so much undocumented space! I want to know what it all is!

simcop238710y ago

Is the regex search in the demo not working for anyone else (tested both Chrome and Firefox on Win7)

rabidsnailOP10y ago

There's no UI for if there are no matches; it just does nothing. Try searching for \.com or something.

Edit: I patched it so it displays an alert if there are no matches.

simcop238710y ago

I see. That patch makes it a lot nicer to find out that none of the sites i wanted to look for show up in the data :)

kitwalker1210y ago

(Update) see rabidsnail's suggestion

not working for me on Chrome or Safari either

gohrt10y ago

why does the hella.cheap site have an SSL cert with an unknown authority?

tokenizerrr10y ago

It has a COMODO certificate. If you see otherwise you might be getting MITMd.

schoen10y ago

kalleboo10y ago

kazazes10y ago

1 more reply

j / k navigate · click thread line to collapse