Comment Spam Heuristics

SpamLately my blog has been hit with a torrential downpour of comment spam.  I’ve been able to fight much of it off with some creative regular

expressions in my ReverseDos configuration file.  Of course keyword filtering, even Bayesian filtering, can only go so far.  We need to supplement these approaches with something else.

But first, in order to combat SPAM, we need to identify the enemy.  Are we fighting against automated bots relentlessly crawling the web and posting comments?  Or are these low paid humans behind the keywords?  Are they attacking via the Comment API or posting to an HTML form?

My assumption has been that these are bots, but I plan to add some diagnostics to my blog to test that assumption someday soon.  Lets run with the assumption that the bulk of comment spam is generated by bots.  In this case, we need to examine the behavioral differences between bots and humans for clues in how to combat spam.

For example, an automated script can pretty much post a spam comment instantaneously.  What if your blog engine timed the interval between sending out the content and receiving a comment? If the comment came back too quickly, then we have high confidence that it is spam.

Certainly this is easily defeated by a spammer by adding a delay, but an artificial delay is costly to an automated script trying to hit the most blogs possible in the shortest amount of time.  Anything to slow down the spammers is worthwhile.

Another potential approach is to require javascript to comment.  Perhaps your comment form doesn’t even exist without some javascript to insert it in there.  The theory behind this approach is that most automated scripts won't evaluate javascript. They simply want to post to some form fields.  Unfortunately this hinders the accessibility of your site for users who turn off javascript, but it may be worth the price.  Spammers will eventually figure this one out too, but it does add a nice computation cost to implement javascript handling in an automated spambot.

Ultimately, these approaches are more about the behavior of the spammer than the content.  For example, when I first started working on Subtext, I added two features that at the time blocked a significant amount of spam for me.  The first was to not allow duplicate comments.  I found that a lot of comment spam simply posted the same thing over and over.

The second feature was to require a delay between comment spam originating from the same IP address.  Using a sliding timeout of only two minutes seemed to defuse spam bombs which would try to post a large number of comments in a short period of time.

Later, I added ReverseDOS to help catch the spam that made it through these approaches.  Over time, I've noticed that comment spam starts to look more and more like legitimate messages, like the current crop of “Nice Site!” spam. 

The one thing that every comment spam has in common is a link.  Ultimately, the only way to stop content spam via a content-based approach is to simply not allow any comment that contains a link in any way shape or form.  But how awful would that be for the many legitimate commenters who wish to share a link?

No, we must do something better. I currently don’t think we’ll ever win the battle, but we can work to stay one step ahead.

What others have said

Requesting Gravatar... Greg Young Aug 28, 2006 6:27 PM
# re: Comment Spam Heuristics
By forcing people to use javascript you will only see the abusers turn to mshtml and other libraries as their form posting tool so they can use the DOM ...
Requesting Gravatar... chris Aug 28, 2006 9:21 PM
# re: Comment Spam Heuristics
I have been having comment spammers for some time so decided to implement a CAPTCHA control. This is basically an image with charcters which the user must imput before submitting the comment. I posted an entry in the forum on presstopia.com, which is the platform I use.

http://presstopia.com/dnn/Default.aspx?tabid=170&ptid=540&threadid=4959&forumtype=posts
Requesting Gravatar... mike Aug 29, 2006 1:54 AM
# re: Comment Spam Heuristics
I've also been getting a lot of comment spam lately on my subtext site. I agree with the ReverseDOS folks that CAPTCHA will only prevent legitimate posts before too long as the bots get more sophisticated.

What about something similar to the challenge/response systems that were developed to deal with email spam? In this scenario, all comments would be moderated. However, when you (as a SubText admin) allow a comment to be displayed, the email address (or some identifier) associated to that commenter is stored in a "valid commenters" list. The next time that commenter leaves a comment, it is posted without being moderated.

The main drawback I see to this system is that commenters will be required to leave their email address if they don't want to be moderated.

I guess another drawback is that there is a lot of moderation involved for the subtext admin. But at this point I'm starting to think that I'm already spending a lot of time cleaning up the spam so I might not be opposed to spending a fraction of the time moderating comments since I'm actually building a white list of commenters.

I'm sure that there are more holes in this system than what I've listed. Thoughts? Ideas?
Requesting Gravatar... Nicholas Paldino [.NET/C# MVP] Aug 29, 2006 2:00 AM
# re: Comment Spam Heuristics
Personally, I've never really liked the guards that were put up to block comment spam. It's an all-or-nothing approach that once someone figures a way around, it is useless.

So, what ends up happening is the guards become tougher, but it makes those that are trying to comment legitimately frustrated, and ends up providing a bad user experience.

To this end, I would say that heuristics are a better approach, because they can be tweaked. PHP-based sites have had a number of options available for some time. The problem with these is that when new heuristic algorithms were created, one would have to constantly update them.

That's where Akismet comes in. It is a (free) online service that you can send your comments to, and it will make a decision if it is spam or not. Then, you can decide to publish the spam or not.

It also has a feature that alows you to say that a comment is not spam if the service says it is, and to say that a comment is spam if the service says it is not.

The great part about this is that theoretically, the more comments that are submitted, the "smarter" it gets.

The link for the site is:

http://akismet.com

There is a developers link at:

http://akismet.com/development/

There is one library there implemented in .NET which you can use. I also have my own library, which of course, I support (and it's free as well).

If you are interested in it, email me, and I will be more than happy to provide it to you.

On a side note, I am eagerly waiting for the release of Subtext 1.9 for ASP.NET 2.0. Keep up the good work.
Requesting Gravatar... agoat Aug 29, 2006 5:19 AM
# re: Comment Spam Heuristics
[Quote]
What if your blog engine timed the interval between sending out the content and receiving a comment? If the comment came back too quickly, then we have high confidence that it is spam.

Certainly this is easily defeated by a spammer by adding a delay, but an artificial delay is costly to an automated script trying to hit the most blogs possible in the shortest amount of time. Anything to slow down the spammers is worthwhile.
[/Quote]


That won't slow them down at all. They may be slower to post individual comments, but their throughput will still be the same. Look at it this way: that block is an expensive IO operation. In your application you can either wait for that operation, completely blocking the entire app, or thread around it.
Requesting Gravatar... Steve Harman Aug 29, 2006 5:40 AM
# re: Comment Spam Heuristics
@Nicholas: The Akismet webservice looks interesting, and the .NET APIs seem fairly straight forward and easy to use.

When I get a little freetime, I'll look into implementing this as a plug-in for subTEXT v2.0.

Thanks for the link!
Requesting Gravatar... Rydal Williams Aug 29, 2006 5:46 AM
# re: Comment Spam Heuristics
This is my nightmare, I've been using stored procs/jobs to delete comments that are not valid, web.config settings on subtext hasn't be very helpful - if spamming is eliminated or decreased, I'll be a happy bird.
Requesting Gravatar... Willie Aug 29, 2006 7:08 AM
# re: Comment Spam Heuristics
Anyway you're going to share the changes to the ReverseDOS file? I've made some changes to mine and I no longer get spam. This was a huge problem with .Text awhile ago. Glad I upgraded :)
Requesting Gravatar... Willie Aug 29, 2006 7:10 AM
# re: Comment Spam Heuristics
What about making it so that IF a link was placed in the message it would need to be moderated before it showed up, and if not it would just plop up on the comments without moderation? I would think that the majority of comments don't have links in them. ?
Requesting Gravatar... haacked Aug 29, 2006 7:33 AM
# re: Comment Spam Heuristics
Great idea Willie. Another subtext developer came up with the same idea. I think we'll try to add that after we release 1.9.
Requesting Gravatar... Corey Aug 30, 2006 12:02 PM
# re: Comment Spam Heuristics
I just upgraded. Am loving ReverseDOS.

I am really interested in seeing what other people have done with theirs.

So I created a wiki to share config files.
http://reversedos.pbwiki.com/FrontPage

The edit password is "p o k e r" without the spaces because it just occurred to me it might mean that this comment will be blocked :)
Requesting Gravatar... you've been HAACKED Aug 30, 2006 6:42 PM
# What About CAPTCHA?
What About CAPTCHA?
Requesting Gravatar... you've been HAACKED Sep 19, 2006 5:47 AM
# Atlas Comment Spam Heuristics
Atlas Comment Spam Heuristics
Requesting Gravatar... you've been HAACKED Sep 25, 2006 10:40 PM
# Lightweight Invisible CAPTCHA Validator Control
Lightweight Invisible CAPTCHA Validator Control
Requesting Gravatar... Creative Minds Sep 26, 2006 12:50 AM
# Lightweight Invisible CAPTCHA Validator Control
Lightweight Invisible CAPTCHA Validator Control
Requesting Gravatar... you've been HAACKED Oct 31, 2006 10:16 AM
# CAPTCHA For Trackbacks
CAPTCHA For Trackbacks

What do you have to say?

(will show your gravatar)
Please add 7 and 1 and type the answer here: