Combating Referrer Spam with WordPress

I hate referrer spams.

Not that I clicked on every referrer URL to see where it leads to. I love my statistics, and anything that skews the result annoys me. I do not want to open up my faviourite log analyzer, and see all the top referrers are pr0n or p0ker sites.

At the same time, it is also one of the most difficult method of spam to combat with, without causing too much inconvenience. You can put all the incoming comments and trackbacks through a moderation queue without really irritating your readers. Try to moderate every page view when it comes from an unknown referrer!

There has been quite a few attempts to resolve this issue in the past.

Mod_Rewrite

My initial attempt is by putting some mod_rewrite rules into .htaccess file, to return 403 access denied to referrer spammers/spambots. For example,

RewriteEngine on
RewriteBase /
RewriteCond %{HTTP_REFERER} pr0n.example.com [OR]
RewriteCond %{HTTP_REFERER} p0ker.example.com [OR]
RewriteCond %{HTTP_REFERER} r0ulette.example.com [OR]
RewriteCond %{HTTP_REFERER} l0an.example.com
RewriteRule .* - [F,L]

The problem is, my .htaccess file soon become a gigantic unmanageable piece of mess, as I started adding spammy host names one after another. Moreover, referrers spammers seem to have unlimited supply of domain names that there is just no way you can catch them all (also caused by free .info domains a while ago). And you normally only find new ones to add when the spambots have already visited your sites and tainted your logs.

Combating against referrer spams with mod_rewrite rules is a battle that can never be won.

Referrer Bouncer

Then I found Referrer Bouncer (via Blogging Pro News), a WordPress plugin that blocks referrer spams by matching against a plain text file. It gives some interesting responses -- instead of denying spambot the access, it actually sends back a 302 "Found" to tell the spammers to go back its own website.

It is designed this way to punish the spammers to consume its own resource. However, the effect cannot be verified, as we do not know whether referrer spammers have actually implemented following the URL.

Does it work? Only slightly better than the old mod_rewrite approach. At least you will not render your blogsite inaccessible when you stuffed up your regular expressions in RewriteCond. However the old issues persists. It still does not address the ever-changing host names of referrer spams.

Referrer Karma

Last week I installed Referrer Karma (2.3b). The design philosophy is:

  1. If the referrer exists in the white-list -- in.
  2. If the referrer exists in the black-list -- out.
  3. Fetch the content of the referrer, and if my site name can be found in the content -- add to white-list and in.
  4. Otherwise, add to black-list and out.

It is actually a bit more complicated than that, but you get the idea.

What is good about this approach is, it places all page views via referrers through a moderation queue, and it automates the moderation by checking whether the referrer is genuine. So there is no more weekly-hunt in my access_log files searching for fishy referrers, as those illegitimate ones would have been filtered out.

I am now running Referrer Karma on some of my sites, and it worked reasonable well. I can see logs of baddies getting rejected at the door, and the good guys getting welcomed in -- most of the time, and at a price...

You see. The method of auto-moderation is by fetching the source content and try to find traces of links. However, a legitimate referrer might not have links inside the HTML content. Links might be loaded and generated via Javascript. It might be clicked from inside an <iframe/>. So in order to reduce false negatives, Spam Karma tried to retrieve up to 8 levels of recursion on Javascript and IFrames -- breadth-first style traversal, all at once, regardless whether the main HTML content has already matched the links or not.

That is a lot of bandwidth usage, especially over some complicated sites where it might import 1-20 Javascripts or IFrames. Over the last couple of days my out-bound link was actually full-satuated for a few hours a day, and initially I could not figure out what caused it except for the fear of getting hacked or getting DDoS'ed. Well, you can blame my relative narrow pipe (only 512kbps out-bound), but after debugging and analysing Referrer Karma it turned out to the cause.

Bad bad coding.

A quick hack to change its behaviour to stop traversal as soon as a match is found, as well as reducing the depth to 2, solved my bandwidth problem. Referrer Karma is again useful now.

Conclusion

There are still issues that make Referrer Karma generating a lot of false negatives. For example,

  • Source pages cannot be reached (side Intranet, behind HTTP auth, etc)
  • Legitimate pages does not contain links (links not available to anonymous users, time-sensitive links)
  • Obfuscated Javascript pages
  • etc...

There is no perfect solution, but Referrer Karma (+ a few hacks) does make sites more spam-free.