> For quality control, I looked only at comments with Reddit score > 100
That's a non-trivial popularity score. Also, since it's an absolute score, it will bias against smaller subreddits, where 100 points on any comment is a difficult task.
This is much less "how people talk on reddit", and much more "the type of comment that gets upvotes on the default subreddits"
Imputation. [1] Remove a word from a sentence then try to predict it from its surrounding context. "when I get home tonight, i vape a ___ then space out". Assign predicted probabilities to imputed word ":leaf emoji:" ["marijuana cigarette", "electronic cigarette", "cigar"].
Active learning. Seed the algorithm with expert knowledge from law enforcement, drug users, and social workers, who know of the encryption keys.
Anomaly detection. Though perhaps easily-confused with other, innocuous usage, street slang is a distinct form of language with its own properties and patterns. Compared to common discourse, it is strange and random. This pattern could be measured.
Doing this rigorously, like building search engines for illegal drugs or human trafficking on the deep web, requires a lot of expert knowledge. [2] Maybe future deep learning can do this end-to-end on arbitrary domains? [3] Let's see.
[1] https://arxiv.org/abs/1312.3005 "One Billion Word Benchmark for Measuring Progress in Statistical Language Modeling"
intentional use of slang terms during serious discourse in order to subtly delegitimize your opponents arguments. There is an Atlanta City Council member currently who, when responding to a question about e.g. medical marijuana will always change the noun to "pot" or "weed" or even "weed.. or.. pot" with an enunciation implying the concept of medical marijuana is a joke.
Awhile ago, SpaCy set up a demo doing just that on the Reddit dataset:
https://demos.explosion.ai/sense2vec/?word=cannabis&sense=au...
https://demos.explosion.ai/sense2vec/?word=marijuana&sense=a...
Reddit comments are very idiosyncratic, and in this particular case, even moreso than usual. As a result, I am skeptical of trusting the output of such APIs as gospel, even one trained on massive datasets. (however, training a model on a Reddit-only dataset might be interesting, and is an idea I have in the pipeline.)
Last year, spaCy trained a model, sense2vec, on the Reddit dataset and got interesting results: https://explosion.ai/blog/sense2vec-with-spacy
(I don't disagree with anything you wrote, just expanding.)
However, it worked on me, I'll probably give these tools a spin in the near future.
so here's some history. the name "marijuana" was pushed by Harry Anslinger[0] as a way to trigger racial anxiety amongst conservative whites who held negative views with respect towards Mexicans. the other names for the plant being "hemp" (a non-psychoactive strain used as an industrial fiber crop) and "cannabis" (latin name for the genus of the plant).
https://www.merryjane.com/news/want-marijuana-legalized-then...
People haven't been this obssessed with a politician's middle name since Barack hit the scene almost a decade ago. I'm really glad you felt the need to add it, since I don't think I would've felt the full dramatic effect of your comment otherwise.
Franklin Delano Roosevelt, Lyndon Baines Johnson, Warren Gamaliel Harding.
For example, a shill or superuser (people getting top comment) will not be using domain specific language -- they will be using language that caters to a general audience. If this is true, you would end up squeezing most of the interesting language out of your study. Have you been to Grass City forums? I am guessing these people surely aren't using terms like "Donald Trump" in their everyday conversations about weed.
Reddit is a huge melting pot and probably isn't a good place for insight about potheads. Grass City might not be either -- Grass City users are not typical potheads. The best place would be 10th grade high school social circles and college dorms. It really is amazing how little data is produced by social networks, in the grand scheme of things. We are all so used to hearing about how much data is produced by the internet. There are orders more data in the raw world just waiting to be scooped up.