Sun Mar 11 22:00:39 EDT 2012

fun with spiders

So I grepped my logs for everything that asks for robots.txt, and the results were interesting.

Firstly, there's a lot of spiders out there!

$ grep "06/Mar" spiders.txt | wc -l
69

When an article isn't trending, or I haven't posted lately, bbot.org does about 1200 hits and 300 uniques a day. Some spiders request robots.txt multiple times, but cloaked spiders don't request robots.txt at all, so it's a wash. This means that fully a fifth of my traffic is from machines.

This is odd, because in terms of actual traffic, Google is the only game in down.

The traditional social contract between search engines and site owners is that you let them download your site, and in return, they drive traffic to you. This contract is broken, and has been for years. Google provides 99.98% of search traffic.

Microsoft, Yahoo, Gigablast and Blekko maintain fabulously comprehensive databases, updated regularly, which nobody ever uses. (DuckDuckGo strips referrer headers from their outgoing traffic, so they might be super popular, I have no way to know) This doesn't even count the foreign-language search engines, like Baidu, Soso Naver, Daum, or Yandex, who I also never ever see traffic from, since my site's in English.

Now, I don't use robots.txt, but still. What a waste of time and money!

Spiders can be roughly grouped into three groups: Google, Everybody else, and spammers. Brandwatch, Sitexploration, Seoprofiler, Linkdex, and Metadatalabs are all "SEO tools", which want to charge you money to get reports on their vague guess at how many people link to you. Spotinfluence even wants you to sign in via Facebook to begin using their site, and if you're dumb enough to do that, you deserve whatever they do to your profile.

Some spiders don't take no for an answer.

208.115.113.85 - - [01/Mar/2012:14:09:51 -0500] "GET /robots.txt HTTP/1.1" 404 169 "-" "Mozilla/5.0 (compatible; Ezooms/1.0; ezooms.bot@gmail.com)"
208.115.111.69 - - [01/Mar/2012:15:37:17 -0500] "GET /robots.txt HTTP/1.1" 404 169 "-" "Mozilla/5.0 (compatible; Ezooms/1.0; ezooms.bot@gmail.com)"
208.115.113.85 - - [01/Mar/2012:18:24:34 -0500] "GET /robots.txt HTTP/1.1" 404 169 "-" "Mozilla/5.0 (compatible; Ezooms/1.0; ezooms.bot@gmail.com)"
208.115.111.69 - - [01/Mar/2012:19:56:54 -0500] "GET /robots.txt HTTP/1.1" 404 169 "-" "Mozilla/5.0 (compatible; Ezooms/1.0; ezooms.bot@gmail.com)"
208.115.113.85 - - [01/Mar/2012:22:02:14 -0500] "GET /robots.txt HTTP/1.1" 404 169 "-" "Mozilla/5.0 (compatible; Ezooms/1.0; ezooms.bot@gmail.com)"

Some really don't.

107.21.161.122 - - [06/Mar/2012:09:46:53 -0500] "GET /robots.txt HTTP/1.0" 404 169 "-" "linkdex.com/v2.0"
107.21.161.122 - - [06/Mar/2012:09:46:53 -0500] "GET /robots.txt HTTP/1.0" 404 169 "-" "linkdex.com/v2.0"
184.73.70.151 - - [06/Mar/2012:14:31:17 -0500] "GET /robots.txt HTTP/1.0" 404 169 "-" "linkdex.com/v2.0"
184.73.70.151 - - [06/Mar/2012:14:31:17 -0500] "GET /robots.txt HTTP/1.0" 404 169 "-" "linkdex.com/v2.0"
107.20.100.1 - - [06/Mar/2012:22:27:30 -0500] "GET /robots.txt HTTP/1.0" 404 169 "-" "linkdex.com/v2.0"
107.20.100.1 - - [06/Mar/2012:22:27:31 -0500] "GET /robots.txt HTTP/1.0" 404 169 "-" "linkdex.com/v2.0"
107.20.100.1 - - [06/Mar/2012:22:54:24 -0500] "GET /robots.txt HTTP/1.0" 404 169 "-" "linkdex.com/v2.0"
107.20.100.1 - - [06/Mar/2012:22:54:25 -0500] "GET /robots.txt HTTP/1.0" 404 169 "-" "linkdex.com/v2.0"

I'm not sure what the rationale is for using such a short timeout is, or asking multiple times. Presumably if a robots.txt request returns a 404, and has been doing so for the last two years, one isn't going to appear five hours later. Google, on the other hand, only checks to see if I've changed my mind twice a day.

While I've got my logs open, I'd like to complain about this: (original domain replaced with .su)

91.229.175.130 - - [11/Mar/2012:19:52:39 -0400] "GET /clubwearguru.su/blog/clubwearguru.su/archives/clubwearguru.su/2009/clubwearguru.su/07/clubwearguru.su/index.htmlhxxp://bbot.org/clubwearguru.su/blog/clubwearguru.su/archives/clubwearguru.su/2009/clubwearguru.su/07/clubwearguru.su/index.html HTTP/1.0" 404 169 "-" "Mozilla/4.0 (compatible; ICS)"

This appears to request a valid page on my server, but with a domain inserted after every /. Does this actually work? Do people really automatically post their server logs online? The mind boggles.


If you liked this, you should donate! bbot.org is out of money, and is not long for this world otherwise.