On my quest to stop the spambot

Over the past few weeks, the Playground has been hit hard by some strange users with strange user agents. Not only the Playground, but in fact all of my sites hosted here are crawled by a user agent identified itself as "Mozilla/4.0 (compatible; MSIE 5.0; Windows NT; DigExt; DTS Agent" (note the lack of closing round-bracket). Googling on the net reveals that this user agent actually belongs to this company called "Beijing Express Email Address Extractor", i.e. a spambot that collects email address from webpages. These email addresses would be sold to spammers, and that tells me one thing - my email address and my users email addresses have been harvested for spamming!

I am thinking of masking out the email addresses on the websites. However, that might also bring inconvenience to the surfers who wish to send me legitimate emails. Therefore I might make more sense to explicitly block out those spambots when they try to crawl the sites. Which blocking mechanism should I use? I am thinking of using a long list of Apache mod_rewrite rules to identify the user agent string and then block accordingly. Is there any other ways? What about the spambots that masqueraded as a proper web browser? Hmm...

I am now on my quest to stop those nasty spambots crawling my sites. I will keep on updating this blog entry as I progress... Updated 23 December 2002: I found this comprehensive document on the Internet - Stopping Spambots: A Spambot Trap written by Neil Gunton. It has quite a few examples/techniques on preventing spambots harvesting emails from your website. I think I will start a spambot trap first because I am also interested to distinguish the masqueraded bots from real users, and by logging their activities might help me to do that. Let's see...