Comment Spam Heuristics
Lately my blog has been hit with a torrential downpour of comment spam. I’ve been able to fight much of it off with some creative regular
expressions in my ReverseDos configuration file. Of course keyword filtering, even Bayesian filtering, can only go so far. We need to supplement these approaches with something else.
But first, in order to combat SPAM, we need to identify the enemy. Are we fighting against automated bots relentlessly crawling the web and posting comments? Or are these low paid humans behind the keywords? Are they attacking via the Comment API or posting to an HTML form?
My assumption has been that these are bots, but I plan to add some diagnostics to my blog to test that assumption someday soon. Lets run with the assumption that the bulk of comment spam is generated by bots. In this case, we need to examine the behavioral differences between bots and humans for clues in how to combat spam.
For example, an automated script can pretty much post a spam comment instantaneously. What if your blog engine timed the interval between sending out the content and receiving a comment?If the comment came back too quickly, then we have high confidence that it is spam.
Certainly this is easily defeated by a spammer by adding a delay, but an artificial delay is costly to an automated script trying to hit the most blogs possible in the shortest amount of time. Anything to slow down the spammers is worthwhile.
Ultimately, these approaches are more about the behavior of the spammer than the content. For example, when I first started working on Subtext, I added two features that at the time blocked a significant amount of spam for me. The first was to not allow duplicate comments. I found that a lot of comment spam simply posted the same thing over and over.
The second feature was to require a delay between comment spam originating from the same IP address. Using a sliding timeout of only two minutes seemed to defuse spam bombs which would try to post a large number of comments in a short period of time.
Later, I added ReverseDOS to help catch the spam that made it through these approaches. Over time, I've noticed that comment spam starts to look more and more like legitimate messages, like the current crop of “Nice Site!” spam.
The one thing that every comment spam has in common is a link. Ultimately, the only way to stop content spam via a content-based approach is to simply not allow any comment that contains a link in any way shape or form. But how awful would that be for the many legitimate commenters who wish to share a link?
No, we must do something better. I currently don’t think we’ll ever win the battle, but we can work to stay one step ahead.