Atlas Comment Spam Heuristics

Remember my recent post in which I suggested that we need more heuristic approaches to the comment spam problem?

Check out this new NoBot control in the Atlas Control Toolkit.  I wonder if this came out before or after I wrote my piece, because I don’t want y’all to think I cribbed my ideas from this control.  It has a couple features that I mentioned.

  • Forcing the client’s browser to perform a configurable JavaScript calculation and verifying the result as part of the postback. (Ex: the calculation may be a simple numeric one, or may also involve the DOM for added assurance that a browser is involved)
  • Enforcing a configurable delay between when a form is requested and when it can be posted back. (Ex: a human is unlikely to complete a form in less than two seconds)
  • Enforcing a configurable limit to the number of acceptable requests per IP address per unit of time. (Ex: a human is unlikely to submit the same form more than five times in one minute)

I think that will be a nice minor addition to a comment spam fighter’s toolkit. It’s Invisible CAPTCHA.  Very cool!

tags: , ,

What others have said

Requesting Gravatar... Nicholas Paldino [.NET/C# MVP] Sep 19, 2006 2:01 PM
# re: Atlas Comment Spam Heuristics
Are you going to be adding this to subtext (or is it already in there)? If so, can you have an option to disable it?
Requesting Gravatar... Haacked Sep 19, 2006 2:08 PM
# re: Atlas Comment Spam Heuristics
I'll probably add it and have an option to enable/disable it. The one problem with this approach is for those who don't use javascript. Not sure if that's a large constituency. I'll probably disable it by default. We'll see.
Requesting Gravatar... Alan Le Sep 19, 2006 2:39 PM
# re: Atlas Comment Spam Heuristics
I was looking at this control last night and found two potential problems.

1) The sample thinks I'm a bot when I used the back button and hit submit again. The javascript function is not called when the user clicks the back button.
2) If I'm a bot writer, I can adapt the response to put in a sleep that's more than a few seconds to mimic a human.
Requesting Gravatar... Haacked Sep 19, 2006 2:43 PM
# re: Atlas Comment Spam Heuristics
@Alan. Interesting about #1. I better test it thoroughly before I use it.

#2. True, but I doubt every current bot writer will do this. They might not even notice that they should do this unless a huge majority of blogs out there implement it.

If I was trying to post a comment on a million websites, It'd suck if each post had to wait a few seconds.
Requesting Gravatar... Steve Harman Sep 19, 2006 3:56 PM
# re: Atlas Comment Spam Heuristics
@Nicholas. I've been waiting to add ATLAS to subTEXT until they get a few more bugs killed... in particular this one.

Then my plan was to start adding AJAX functionality to the subTEXT Admin pages, and I think Phil had a few places in the main UI that he wanted to throw some updatePanels.

The official bug report that I opened (mentioned at the bottom of the article I liked to) still hasn't gotten any play from the ATLAS team... so this bug seems to be in limbo somewhere.
Requesting Gravatar... Mike Dimmick Sep 19, 2006 9:57 PM
# re: Atlas Comment Spam Heuristics
Please, not more filtering based on IP addresses. Over here in Europe it's very common for many employees to be behind a single NAT as we have fewer public IP addresses available than companies in the US. This is a historical aspect of how the address space was assigned. There are universities in the US with Class A allocations, potentially 16 million addresses, of which they will only use a tiny fraction, a few thousand at most.

In the Far East, whole ISP user communities can be behind a single public IPv4 address. The problem is that bad.

Any assumption that one IP address from the server's perspective == one user is Bad.
Requesting Gravatar... Jason Haley Sep 19, 2006 11:01 PM
# Interesting Finds: September 19, 2006
Requesting Gravatar... David Anson Sep 20, 2006 2:20 AM
# re: Atlas Comment Spam Heuristics
Love the feedback on NoBot, please keep it up!!

I'm the author of NoBot, so I thought I might be able to add some more information here.

First, NoBot was not influenced by Haacked's earlier post. The similarity is definitely there [great minds think alike? :) ], but NoBot already existed in pretty much the form it released in for some time before August 29th. I tried to do my research before starting NoBot, and didn't come across too much relevant information. The aforementioned blog post definitely would have been a good find! :)

Alan Le's observation that hitting Back in the browser causes problems is a good one and I've made a note to look into the matter. I suspect he's right that the issue is that the issue is related to the browser being "smart" and avoiding another trip to the server. One could probably disable caching for the page via something like "no-cache" (I do this in our automated test suite, for example), but that's not the kind of thing I wanted a control to be doing to its page without very good reason. :) Page authors are, of course, welcome to take this approach if it helps.

The comment about the sleep being easily thwarted is both true and not true. :) Yes, bot writers can manually do a sleep, but as Haacked notes doing so will tie up their resources a bit. Probably not enough to stop them, but that's just one of the potentially many schemes NoBot uses to try to avoid bots/spam. Additional suggestions/recommendations are both welcome and encouraged! (Just drop a note to me via my blog and we can discuss your idea.)

Mike Dimmick's observation about IP address filtering is another good one. The good news is that NoBot's parameters can be set such that IP filtering is disabled (try setting CutoffMaximumInstances really high or just modifying the code to remove filtering entirely (the source code's free, remember!)). Again, IP filtering is only one of NoBot's schemes. I'd love to have more of them so that folks could pick and choose which ones were relevant to their particular sites/users.

Thanks again for the feedback - I hope you find NoBot helpful!!
Requesting Gravatar... Haacked Sep 20, 2006 2:24 AM
# re: Atlas Comment Spam Heuristics
Thanks David, I would have been surprised if my post had anything to do with NoBot. That would be too fast a turnaround! :)

Forcing a computation on the client is similar to approaches considered for dealing with Email Spam. It's just easier to implement for Comment SPAM as you've demonstrated. Nice work!
Requesting Gravatar... Ken Sep 20, 2006 12:17 PM
# re: Atlas Comment Spam Heuristics
Isn't it supposed to work though? I tried it through FeedDemon, IE itself, and Firefox. The demonstration said I was a bot on all three with a InvalidBadResponse message, suggesting the script didn't run, though I don't have javascript disabled or anything. As it turns out, I was seeding a torrent when trying it, so even that limited upload rate seems to have caused a problem with it getting the response back in time. Great idea, but I think it still needs a little work. Letting spam get through sucks, but annoying users with false positives is worse.
Requesting Gravatar... David Anson Sep 20, 2006 2:07 PM
# re: Atlas Comment Spam Heuristics
Ken, I'm not sure what the problem was for you. We test in IE6, IE7, Firefox, and Safari, and NoBot works fine for us in those environments. The only time restriction it imposes is on posting too *soon*, so I don't think that was the source of the problem in your case. I agree with your theory that the script didn't seem to run. Maybe try again when you're not seeding a torrent and/or check to see if you have JavaScript blockers in place somewhere? Thanks!
Requesting Gravatar... David Anson Sep 20, 2006 2:42 PM
# re: Atlas Comment Spam Heuristics
FYI, the work item to investigate the Back button issue above is now work item 3493 in our CodePlex project:
http://www.codeplex.com/WorkItem/View.aspx?ProjectName=AtlasControlToolkit&WorkItemId=3493
Requesting Gravatar... Ken Sep 20, 2006 5:30 PM
# re: Atlas Comment Spam Heuristics
After I tried it and it failed, it occurred to me I had the torrent running, so I closed it and then it started working. I saw the too soon behavior, but that gave me a different message. With the torrent open, it was the InvalidBadResponse. My best guess (I'm assuming here) is that you do some sort of ajax request/challenege before the post submits and it might be possible that didn't complete before the page submitted due to the additional traffic on my upstream, causing a slow down.
Requesting Gravatar... Community Blogs Sep 26, 2006 7:05 AM
# Lightweight Invisible CAPTCHA Validator Control
Not too long ago I wrote about using heuristics to fight comment spam. A little later I pointed to the

What do you have to say?

(will show your gravatar)
Please add 7 and 3 and type the answer here: