The code isn't public because I was concerned about people taking it to make a more popular version of the same thing. Not that it is difficult to glue together the two APIs. It is also an embarrassing mess of around 150 lines of Python.
One issue with linking to the commits or repo is naming & shaming. The other is, as I mentioned, people trying to get on the bot intentionally.
It scans a GitHub API once a minute so as not to put noticeable strain on their API. I think the firehose of constant commit messages has only gotten worse.
Here is what we did, as far as I recall. This worked a heck of a lot better than I expected it to.
1. Make a copy of the text and replace all the common non-letters that look somewhat like letters with those letters. E.g., 0 => O, 1 => L, 3 => E, 4 => A, 7 => T, @ => A.
2. Do a spell check on each word. Set a flag on each character that comes from a word that passed the spell check.
3. Scan the sequence of characters, ignoring all punctuation and white space, looking for sequences that match known profanity.
4. For each sequence that matches, mark it as profanity if any of its characters were not flagged in step #2.
As described above this probably would not scale well to high traffice because of the spell check. That could be improved by changing the order. Look for potential profanity first. Only if some is found would you then do the spell check, and you would only need to spell check the words that span the potential profanity.
Why would that be a bad thing?
"Problems can occur with the words socialism, socialist, and specialist because they contain the substring Cialis, the brand name for an erectile dysfunction medication commonly advertised in spam e-mails."
Measuring code quality in "WTFs per minute" (as detailed on this famous comic - http://i.imgur.com/J1svNp7.jpg)
The less WTF/min, the better the code is.
I know people in all occupations swear, but this puts the focus on it and allows people to quantify it.
Just imagine the article:
"One report showed that the f-word was used over 5,000 times in a single month. No other profession that's been measured has showed near this level of profane, unwelcome environment. How are stay-at-home parents supposed to feel welcome in this community?"
I think we as society will soon need to learn how to "forgive and forget".
As technology becomes more ubiquitous, we are quickly getting to the point of having the majority of our time in this world recorded and stored permanently. Every mistake, every poor decision, every fashion fad, every comment, every choice seared into the never forgetting internet.
We will need to move away from the idea that people can't change, and that something a person did 10 years ago means anything significant today. We should celebrate changing opinions, not attack someone for "waffling". We should be giving significantly less weight to actions and opinions far in the past, and more to things that happen more recently.
There are commits on some projects i've worked on that I absolutely wouldn't word that way now. There are silly jokes and things like it in messages that I wouldn't do now. But my past is my past, and it honestly says very little about my future. I know people that are consistent with some things throughout their whole lives, and others that change drastically over the course of a year.
It won't be pretty, but we will need to learn as a species that people make mistakes, and that shouldn't be a lifelong never-ending mark against someone, and in some cases the opposite is true! I lost a significant amount of money years ago because a hard drive died and my backup hard drive died as well, I now have a militant backup strategy which I am unwavering on because of that experience.
I don't think we should shy away from tools like this, in fact I think they will be more important in the future. Expose people's "skeletons" that are hidden in plain sight. Show the world that it's not a big deal. It will hurt at first, but hopefully over time society will begin to change course, and "character assassination" style headlines will be a thing of the past when "Person did something bad 10 years ago" isn't a big deal since they have 9 years of "newer" history showing they are not the same as they were.
People fear the unknown, people fear change. Showing them that the "foul programmer" is no different than they are is (in my opinion) a better way of dealing with the problem than hiding it and hoping the public doesn't find out.
Author obviously has never worked a blue collar job...
"Get the fucking login button working"
"Fuck painbody, they don't know how to wire up a login button"
I write a lot of commit messages with "offensive" language, but it's not directed at anybody and it's just to blow off steam. How are stay-at-home parents supposed to feel welcome in this community? Release the death grip on their pearls and realize that people swear and that there's no way a "bad word", on its own, can hurt you. Unless someone is directing harsh language at you, you have no reason to be offended by the word "fuck". Good god, it's just a word.
Extremely naive to think that just because it is a word, it does not matter.
A single word in the wrong place or at the wrong time, could destroy your career, make you a social pariah or even get you killed. Words are powerful.
Hmm, I'm not sure I would want to be labeled, in public, as a "bored Microsoft programmer." His manager is wondering "Did he write this on Microsoft time?" "What else is he doing to alleviate his boredom?" "I give him plenty of work to do so why is he bored?"
One day, we get a call that an operator couldn't send an email since the third word was classed as "offensive", however the message body was fine. After firing up the debugger, it became obvious... the customer's surname was "Dick".
We never got to the bottom of how to solve it, so hacked something in. I wonder how many times Mr. Dick has issues with false positives in profanity filters.
I find it interesting that this Twitter bot seems to have the same problem in reverse: it can't reliably filter out things that aren't offensive.
Edit: Thinking about it, it's really still the same problem, i.e. false positives when trying to automatically determine whether or not a given string contains swearwords.