Comment Spam Heuristics

comments edit

SpamLately my blog has been hit with a torrential downpour of comment spam.  I’ve been able to fight much of it off with some creative regular

expressions in my ReverseDos configuration file.  Of course keyword filtering, even Bayesian filtering, can only go so far.  We need to supplement these approaches with something else.

But first, in order to combat SPAM, we need to identify the enemy.  Are we fighting against automated bots relentlessly crawling the web and posting comments?  Or are these low paid humans behind the keywords?  Are they attacking via the Comment API or posting to an HTML form?

My assumption has been that these are bots, but I plan to add some diagnostics to my blog to test that assumption someday soon.  Lets run with the assumption that the bulk of comment spam is generated by bots.  In this case, we need to examine the behavioral differences between bots and humans for clues in how to combat spam.

For example, an automated script can pretty much post a spam comment instantaneously.  What if your blog engine timed the interval between sending out the content and receiving a comment?If the comment came back too quickly, then we have high confidence that it is spam.

Certainly this is easily defeated by a spammer by adding a delay, but an artificial delay is costly to an automated script trying to hit the most blogs possible in the shortest amount of time.  Anything to slow down the spammers is worthwhile.

Another potential approach is to require javascript to comment.  Perhaps your comment form doesn’t even exist without some javascript to insert it in there.  The theory behind this approach is that most automated scripts won't evaluate javascript. They simply want to post to some form fields.  Unfortunately this hinders the accessibility of your site for users who turn off javascript, but it may be worth the price.  Spammers will eventually figure this one out too, but it does add a nice computation cost to implement javascript handling in an automated spambot.

Ultimately, these approaches are more about the behavior of the spammer than the content.  For example, when I first started working on Subtext, I added two features that at the time blocked a significant amount of spam for me.  The first was to not allow duplicate comments.  I found that a lot of comment spam simply posted the same thing over and over.

The second feature was to require a delay between comment spam originating from the same IP address.  Using a sliding timeout of only two minutes seemed to defuse spam bombs which would try to post a large number of comments in a short period of time.

Later, I added ReverseDOS to help catch the spam that made it through these approaches.  Over time, I've noticed that comment spam starts to look more and more like legitimate messages, like the current crop of “Nice Site!” spam. 

The one thing that every comment spam has in common is a link.  Ultimately, the only way to stop content spam via a content-based approach is to simply not allow any comment that contains a link in any way shape or form.  But how awful would that be for the many legitimate commenters who wish to share a link?

No, we must do something better. I currently don’t think we’ll ever win the battle, but we can work to stay one step ahead.

Comments