Mon Jan 17 06:05:47 PST 2011
more fun with wget
So I downloaded Everything2. Well, most of it. The first two million nodes. It's about 31 gigabytes, uncompressed, and 1.4, compressed, in case you needed an object lesson in how redundant HTML formatted english text is.
You can download it here.
Why? How? Those are good questions.
I've been hanging out on #archiveteam for a while, and after you see the twentieth site shut down, you tend to notice trends. One of those trends is sites that aren't terribly hip, are fairly old, aren't actively profitable, and are owned by Yahoo, tend to get axed. The highest profile of any of archive team's projects, the Geocities rescue, is a prime example.
E2 is pretty old, dating from 1998, and occupies a special place in my heart, being the first place online I used the "bbot" name. It has a really enormous amount of content that's available nowhere else, is slowly losing users, and has no publically available backups. I decided to fix that.
Downloading e2, initially looked, if not terribly challenging, at least somewhat tedious. Through long experience, the best practices for downloading the contents of a large website involves spidering links with one program, and downloading them with another one, and is kind of fiddly and annoying. It also consumes excitingly non-trivial amounts of computer power, which is at a premium on the machine which was going to do the downloading.
This also has the obvious disadvantage that grabbing new pages as they're added by their human-readable titles requires re-spidering the entire site.
However, e2 has a legacy interface which allows you to access each page by it's node-id, which is a simple integer.
Why, this nail rather resembles once that I have previously brutalized with a similar hammer.
I used the exact same script, changing only the loop conditional (i=1; i <= 2000000; i=i+1) and, of course, target URL. Since I was in no particular rush, I left --random-wait enabled, which throttled wget to one request a second. With two million nodes, this took 23 days to complete.
Humorously, two million files in one directory is enough to make bash completely flip its shit, which makes things like tab completion and wildcard expansion fail in new and amusing ways. Not being able to run ls, or do a simple "tar *" gives working with large directories a curiously remote feel, like the linux equivalent of a glove box. After some comical blundering around with character escaping, I finally hammered that particular nail flat by reusing the same script used to generate the URLs, and fed that to tar using the --files-from option.
Running tar revealed that there were quite a few missing files where everything2.com returned assorted 50X errors, or timed out, for a total of ~7000 missing files. (A not terribly bad fail rate of .35%) Since the point of an archive is to archive all the files, this was clearly unacceptable, yet I didn't really know how to grab those files again. I thought about maybe parsing the gigantic log file wget produced for errors, and then with some regex magic turning that into a file list, but it was pretty clear that was going to be an awful kluge. So I applied some actual thought... and used the same script again.
#!/bin/bash for (( i=1; i <= 2000000; i=i+1 )) do if [[ ! -e "index.pl?node_id=$i" ]] then wget --user-agent "Mozilla/5.0 (compatible; Googlebot/2.1; (Not really, archiveteam.org))" "http://everything2.com/index.pl?node_id=$i" fi done
This increments $i by 1 up to 2,000,000, then passes it to an if conditional, which uses -e to check if "index.pl?node_id=$i" exists, but uses the "!" negation symbol to check if that file doesn't exist. Compared to several earlier stabs at the problem, this is impressively elegant, and blazingly fast.
Running this script a couple of times revealed that there were 236 files that everything2.com would always return an error when asked for, making the description of this post somewhat inaccurate. It's not two million nodes, instead, 1,999,764. I've made the e2 dev team aware of the problem.
Now that you've seen what enormous amounts of trouble I've endured to produce the torrent, I'm sure you're feeling pretty guilty for not downloading it. Maybe you should. Just saying.