Fri Jun 19 18:11:39 UTC 2009

fun with wget

So one Matt Blind, author of an excellent piece on bookstore customers also happens to produce bestseller lists for graphic novels. How? Well, in his own words:

The Core of the Charts is made up of data from three sites: Amazon, Barnes & Noble, and Borders.

Once a week, I visit each site to check their Graphic Novel categories, and I sort the search results by "bestselling". The links above will pull up exactly that.

I then click through, page after page, and type the titles into a spreadsheet in the order that they are ranked on the sales site. [this is the hard part]

And once I have a full list, I assign points to the books depending on how highly they rank. Add up the points each title earns (and add on similar data from a half-dozen second-tier sales sites) to get a composite score, and there's your ranking.

In concept, it's that simple.

In practice, because the sites themselves can update as often as once an hour, after I load up a website & sort the search results, I then click open each new page in a new tab until I have 20-100 tabs open, representing a snapshot of the full sales (top 900-1200 titles) of this particular sales site over a relatively short time-frame (10-15 minutes). And then I start the data entry.

Wow, that sucks. What did Raymond say about data entry?

Hackers (and creative people in general) should never be bored or have to drudge at stupid repetitive work, because when this happens it means they aren't doing what only they can do -- solve new problems. This wastefulness hurts everybody. Therefore boredom and drudgery are not just unpleasant but actually evil.

You're right, Raymond! This sorely needs some automation.

But complete automation would be Nontrivial, so lets take the merest of first steps. Opening hundreds of tabs sucks. How about just one tab?

First thing, I wrote a little bash script to generate the links list, since it would, of course, be 101 lines long, and I didn't feel like copy and pasting each one from firefox, or whatever.

First, let's look at the base url.

Clicking on "next" results in:

The string of interest is "No=20", which increments by ten every page. So the resulting script is:

for (( i=30; i <= 1000; i=i+10 ))
echo "$i&N=0+989443&Ne=989443&visgrp=fiction&act=BC_ANC"

This script creates a variable (i), sets it to 30 (i=30), tells it to run the loop as long as i is less than or equal to 1000 (i <= 1000), then increments i by 10 each loop. (i=i+10). It then inserts the current value of i in the base url. It starts at 30 because I already had the first three pages of sales ranks.

Running this script and piping the output to "links.txt" results in a 101 line long text file, with a url on each line. We tell wget to use this file, to wait a random time between 0 and 2 seconds between hitting each link, and to use the user-agent "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0; GTB6; SLCC1; .NET CLR 2.0.50727; Media Center PC 5.0; .NET CLR 3.5.30729; .NET CLR 3.0.30618)", who happened to be the most recent visitor to when I plundered the log for user agents. We also tell it to log to wget.log, and to be verbose, because --verbose is awesome.

wget --random-wait --user-agent="Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0; GTB6; SLCC1; .NET CLR 2.0.50727; Media Center PC 5.0; .NET CLR 3.5.30729; .NET CLR 3.0.30618)" --input-file=links.txt --output-file=wget.log --verbose

Wget then hogs the terminal, but remains silent, because we told it to output to wget.log. We can watch along in another terminal by tail -f'ing wget.log, and occasionally ls'ing. It goes pretty fast, taking 20.52 seconds to download 7.4 mebibytes worth of html. After cat'ing everything you end up with a mighty browser-crushing 7.4 mebibyte html that makes Opera colossally unhappy, but which Firefox can handle fine-ish, and takes forever to upload, so we bzip it to a svelte 220 kibibytes.

Standard large html file disclaimers apply. It's big big big, media rich, and outrageously malformed.

Posted by | Permanent link | File under: Linux