psst.. this blog is on hiatus.

Proposal on referrer spam: Background and blacklists

Referrer (or referer) spam has become a serious problem in the blogosphere. We need an intelligent way to eliminate this growing nuisance. I’ve thought about and researched this for the past few days, and below I offer a proposal for a technological solution to this problem. It requires programming, and I am not a programmer, so I welcome suggestions, corrections, and improvements to this proposal.

I hope that this blog entry can serve as something of a starting point for information about referrer spam as well as a sandbox for exchanging ideas about methods of curbing or eliminating it.

Table of Contents

Background

I will not go into much detail about the definition or origins of or reasons for referral spam. Please refer to other sites for that. I will mention that spammers are not stupid, and their activities always have a purpose. Spammers’ activites consume their own resources, and as long as bloggers continue to publish records of referrers, it will be profitable and worthwhile for referral spammers to continue in their endeavors.

Doesn’t rel=”nofollow” solve the problem?

As you may have heard (or were hinted at), an illustrious coalition of blogging and search engine companies recently announced support for a new HTML attribute designed primarily to combat comment spam. Potentially, it’s even more effective for referral spam. The attribute is called rel="nofollow",and many bloggers are already praising it as the silver bullet the Web’s been waiting for.

The idea is actually quite simple; the hard part was getting the major players (Google, Yahoo, MSN, etc.) to agree on it. Basically, if a link is tagged with the rel="nofollow" attribute, it won’t contribute to that site’s PageRank. (“PageRank” is a Google-specific term, but I’m using it in the generic sense here.) Blogging tools such as Movable Type have implemented this standard by inserting the nofollow attribute in links in comments and TrackBacks. This link would not boost my PageRank even a smidgen:

<a href="http://underscorebleach.net" rel="nofollow">

That means comment spammers and referral spammers won’t get rewarded for their nefarious activities on websites that implement nofollow. So, is the problem solved? Maybe. Partially. But ultimately? Not in my view. Here’s why:

  1. nofollow will never reach 100% adoption, so there will always be some incentive (even if it’s decreased) to spam.
  2. Spammers have shown that they do not care whether their techniques are effective in specific so long as they are effective in general. I have never published my referrer logs and referrer spammers have no real reason to hit my site, yet they do. They are targeting the blogosphere, not my site. Thus, as long as the blogosphere remains even partially vulnerable to referral spamming, it will continue. (Mark Pilgrim agrees with me here.)
  3. The resources required to fight spam, especially referral spam, so far oustrip the resources required to create it that nofollow is not a strong enough disincentive.

To expand upon point #3, consider just how easy it is to create referral spam. It’s far easier than comment spam—there are myriad tools in MT, WP, and other publishing systems to combat this nuisance, so comment spam is not nearly as simple an enterprise as it used to be. It’s also simpler to create than spam e-mail (and most e-mail users are protected by at least some sort of spam filter; logfiles are not). Referral spam is one HTTP request. The client need not acknowledge the response. It need not send anything but a simple packet with formatted text.

Here, why don’t you spam Google for fun: go to wannaBrowser, enter Google.com as the Location, and enter anything (how about “This is referral spam!”) as the Referrer. Voila! You sent referral spam to Google. Amazing.

Moral of the story: Since rel="nofollow" is not a panacea, bloggers are still going to get referrer spam.

Recommended webmaster practices

Referer spam is a problem because spammers can improve their sites’ Google PageRank by getting listed on popular sites through spoofing of the HTTP_REFERER field in an HTTP request. (Jay Allen has suggested in the past that referral spammers want clickthroughs, but in an e-mail exchange, now agrees that they probably do it for the PageRank. Still, clickthroughs could be part of the equation, given that so many of these spamming sites are shut down quickly by their hosts.)

Best practice #1: Don’t publish your referrers

If bloggers (and other website maintainers) did not publish this information, spammers would not bother to send these spoofed requests to blogs—it would be pointless. (For a humorous example, check out a blog entry on this very subject that’s actually being targeted by pr0n site referral spammers.) Therefore, I propose that bloggers discontinue this practice. Others agree. I, for one, have never clicked on a link published in a blog’s “Sites referring to me” (or similar) section. I think many bloggers simply believe this is a neat feature and have not evaluated its detrimental effect on the blogosphere as a whole.

Best practice #2: If you must publish referrers, include the page in robots.txt

If you’re married to the idea of publishing referrers, you might want to try dasBlog 1.7, which looks to have built-in support for a referral spam blacklist. Also, take note on this great idea from Dave Winer (of Radio UserLand fame):

Winer says, “A couple of weeks ago we finally figured out why porn sites add themselves to referer pages on high page-rank sites: to improve their placement in search engines. Last night at dinner Andrew Grumet came up with the solution. In robots.txt specifically tell Googlebot and its relatives to not index the Referers page. Then the spammers won’t get the page-rank they seek.”

Grumet’s idea is echoed in a recommendation for b2evolution users. Of course, this works only if you publish your referrers separates from the rest of site’s content. If it’s embedded, robots.txt can’t help.

Best practice #3: Rob spammers of PageRank with rel=”nofollow”

With the introduction several weeks ago of rel="nofollow", you can also rob the spammers of PageRank at the link-level, not just the page-level using robots.txt. All links referrer section of your website linking to externals websites should carry the rel="nofollow" attribute, without question.

Best practice #4: Gather a cleaner list of referrers using JavaScript and beacon images

As detailed by Marcel Bartels, referrer statistics gathered from beacon images loaded via JavaScript document.write statements are far more trustworthy than what the raw web server logs will contain. You may choose to disregard the referrers section of your server logs altogether and rely wholly on beacon images for referrer stats.

The .htaccess arms race is unwinnable

Referrer spammers are becoming more clever. They’re registering odd- or innocuous-sounding domains that redirect to the “mothership”—sites with names like houseofsevengables.com (not teen-fetish-sex.com) that are difficult for a human to distinguish from a legitimate website. It’s especially difficult because bloggers like to pick odd-sounding domains for their websites anyway. (For some fascinating speculation about referrer spam, see Nikke Lindqvist’s post, “Referral spammers – why are they doing it and what should we do about them?”)

In response to the ever-growing problem, many bloggers, including me, have begun fighting an unwinnable war with the referer spammers at the .htaccess level with mod_rewrite. (Some have even taken steps to automate this, such as with Referrer Spam Fucker 3000 or homebrew scripts).

But it’s not working. Take a look the following:

There are legitimate referrers in the file, but many illegitimate sites as well. Can you tell at a glance which are which? Not without visiting them… and that’s a problem. The .htaccess grows, and the spam still comes in. And more RewriteCond’s = greater chance of false positives.

Moral of the story: The arms race of .htaccess blocking is unwinnable.

Technical characteristics of referral spam

I started to look at the individual HTTP request made by the referral spammers. Here’s an example:

216.204.237.7 – - [13/Jan/2005:01:58:00 -0800] “GET /mt/mt-spameater.cgi?entry_id=764 HTTP/1.0″ 200 5472 “http://www.paramountseedfarms.org/” “Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.2; .NET CLR 1.1.4322)”

The spammer has taken pains to make the request look legitimate. The user_agent string looks very much like MSIE. (Interestingly, the request is in HTTP/1.0. Perhaps one could write a rule to exclude logfile entries in HTTP/1.0 with referrers. Normally, spiders that use HTTP/1.0 do not pass a referrer. Someone else more knowledgable than me would need to verify this.)

Also, it’s not as if all of the referral spamming is coming from the same IP or set of IPs. Someone is commanding a large set of zombies here. (About a year ago, juju.org tackled the referral spamming problem with some nice directions and found that the spam was coming from a single IP, but I believe things have gotten more complex since then.) I trawled through the logfile for 13 January 2005 for eight random referrer spams and found eight different IPs.

BUT what is special about the request from the referrer spammer is that his IP is probably blacklisted somewhere. Now, as long as we follow through on the recommendation above to stop publishing referrers, there’s no need to try and block the request in real-time. Besides, this would be a waste of resources and would hurt the 99% of users who are legitimate. However, we can query blacklists, such as through Distributed Sender Blackhole List (DSBL), at logfile analysis time to filter out the referral spam.

Also see Nikke Lindqvists’s technical analysis of referral spam.

Idea #1: Filter referrer URLs against Jay Allen’s MT-Blacklist

Previously on this site, I have criticized MT-Blacklist. That doesn’t mean I don’t think Jay Allen has done great work, and the current blacklist is a masterpiece—when used properly.

In this situation, I believe the blacklist could be a powerful, efficient weapon against referrer spam. See the current master blacklist file and compare it against my .htaccess rules, for example. Sites like houseofsevengables.com and canadianlabels.net are listed in the master blacklist file.

Therefore, if a logfile analysis program was to filter referrers against this list, it would certainly help root out spam. Also, the master blacklist is a simple text file that can be downloaded from a website (and also easily mirrored). It seems to me that this idea could be easily implemented. In fact, Omar Shahine has already written a .NET class to filter URLs against the MT-Blacklist.

The master blacklist isn’t perfect, however, and a quick check of the file against the referrers that got through on 13 January 2005 shows that few or none of them were listed. That’s why we should also consider Idea #2.

Another interesting development to note in this area is the Manila Referrer Spam Blacklist (MRSB). It seems to still be in the experimental stage at this point, but its XML-RPC approach is interesting. It would be fairly trivial to write plugins for popular blogging software allowing users to contribute spamming URLs to the MRSB database. The trick, I believe, would be in the vetting process. Right now I don’t see that one exists (or I just don’t understand it).

UPDATE 1/21/05: The idea is starting to catch fire. (Perhaps I originally posted this entry in one of those rare times when a few people are thinking about the same problem and arrive at the type of solution via multiple paths.) In any case, Tony at juju.org has developed the derefspam.pl Perl script to filter log files against Jay Allen’s blacklist. In a similar vein, Rod at Groovy Mother wrote a patch for AWStats to do the same thing. Mark McLaughlin has followed suit for Shaun Inman’s Shortstat. Great work!

UPDATE 2/1/05: Peter Wood has extended this idea to mod_security, writing the Perl script blacklist_to_modsec to combine the Jay Allen’s blacklist and web server-level spam control. This goes “beyond the blog,” baby. Nice.

Idea #2: Filter referrer IPs against spam blacklists

I recently implemented Brad Choate’s MT-DSBL, a plugin that checks a commenter’s IP against the blacklist maintained at DSBL (a service that keeps a list of open relays). I believe the general idea of combatting comment spam by harnessing the DSBL or DNS-based blackhole lists could also be used to ferret out referral spam.

I queried eight randomly selected referrer spamming IPs against OpenRBL.org. This website queries 28 blacklists and returns a Positive/Negative score, with a “positive” indicating that the IP is listed on the given blacklist.

# IP address Positive/Negative
1 66.237.84.20 4/24
2 213.172.36.62 0/28
3 61.9.0.99 1/27
4 68.47.42.60 7/21
5 203.162.3.77 1/27
6 193.188.105.16 2/26
7 213.56.68.29 7/21
8 200.242.249.70 7/21

The above table’s scores are as of evening, 13 January 2005 (CST). They may be different if you check an IP’s blacklist presence now.

My proposal for log file filtering of referrer spamming is rather simple:

  1. For a request with a referrer, query the IP against a blacklist. This might be DSBL or another list. I’m certainly not the best one to decide.
  2. If the IP is blacklisted (or has a high score among a multitude of blacklists), refrain from listing that referring URL in any section of a site’s Web stats.
  3. Once a given site has been identified as a referral spam hostname (e.g. houseofsevengables.com, as mentioned above), do not bother querying the blacklist again for any IPs with this hostname in the HTTP request. This is simply for efficiency’s sake.
  4. Once an IP has been identified as a referral spamming IP, do not bother querying the blacklist again. Again, efficiency’s sake.

UPDATE 1/16/05: Also, Chris Wage has written up a great set of directions for using the mod_access_rbl module in Apache to match IPs against DNSBLs. While this won’t catch 100% of referral spamming IPs (see the variation in scores above), it should cut down on the number that get through. You might wonder what sort of affect this method would have on site performance. Here’s Chris’ response:

Response time is affected, but not much for normal usage. The query responses are cached by [your] local nameserver on the same network, so the most someone would notice is a slight delay on the first load of the page.

Conclusion

Referral spam will not go away until bloggers make it a useless enterprise for spammers. Spammers are not stupid, and they will gradually stop the practice if they see that their efforts have no return.

In the meantime, I propose the above methods for filtering the referrer stats of websites. This is performed at logfile analysis time, not when the HTTP request is made. It seems to me that Ideas #1 and #2 could be combined, with #1 more efficient for client and server and #2 more likely to be up-to-date in real-time.

I welcome all comments. I am certainly no expert in these matters, but in searching the Web, I have found a lack of discussion in this area.

Addenda

It’s also been suggested that Web stat scripts could check the referrer’s website for a link back to one’s website. If no link is found, the script would assume that the site is a bogus, spam URL. I see two problems with this approach:

  • Blog indexes change quickly. What’s on the index page at 2:00 p.m. might be gone at 2:30 p.m. This can be because the blogger deletes the link to your site or because the index “rolls over” and displays only the past 10 entries.
  • Spammers could quickly adapt. They could simply link to every site they spam.

If you do use the .htaccess method to combat referrer spam, I suggest wannaBrowser to test your rewrite rules. It’s the simplest way to see whether you’re properly blocking spam URLs. The htaccess blocking generator will help you write the rules. SixDifferentWays has a pretty complete post on battling the spam this way. Ed Costello’s article is also good.

Other resources