Blocking Mirroring Bots

A blog/journal site that I've read often listed on Rice Bowl Journals has been taken off-line when I visited it today, because someone else has been mirroring it. From a link provided, it seems soksok.jp is a DoCoMo service provider for the mobile phone users in Japan to read blog sites (actually, any website) on the Internet, with complex layout stripped. It acts as an proxy service, and it will return and cache whatever URL you have requested. It has nothing new. This technology has been there for years, providing stripped/simplified content for smaller devices with embedded browsers.

However, what concerns about the bloggers is the privacy and copyright issue - I don't want other people to read my journal from another website. Especially if you explicit block other search engine bots, people can still find contents of your website via proxied content in soksok.jp. Basically their bots do not acknowledge robots.txt, and will mirror your entire site without your agreement. I too will be annoyed if my private entries in my journal become search-able in Google.

The solution now I guess is to block their bots. Here's something to be added to your .htaccess to block all bots that identify themselves as "DoCoMo..."

  SetEnvIf User-Agent ^DoCoMo KEEPOUT

  <Limit GET POST>
    Order Allow,Deny
    Allow From All
    Deny From env=KEEPOUT
  </Limit>

That will only block "well-behaved" bad bots that also send its identification. There are a lot of other bots like email harvesters that you might want to block as well. For more information, there's an article/discussion on WebmasterWorld.com:

Proper Syntax for Banning Bad Bots in htaccess