Fri Nov 11 08:10:36 EST 2011
guerilla archiving I
Remember when I downloaded everything2.com and a dozen people sent me screamingly angry emails, and the whole thing was generally stressful and unrewarding?
Well shit, let's do that again, but with a a different site. This time, though, I sent them an email first:
Hi, I'm Sam Bierwagen, a volunteer with the Archiveteam project. (http://archiveteam.org/) We make independent backups of sites of historical or cultural interest that, for whatever reason, (are being shut down by yahoo, like Geocities; or are being crippled by the host company, like Delicious) are at risk of disappearing. AO3 is dedicated to hosting copyright-infringing content, and depends on donations to keep operating; a combination that, in my experience, does not result in spectacular longevity.
Typically, we operate under extreme time pressure, which requires tactics that tend to generate some friction with operators that don't appreciate a dozen pageloads per second from our web spiders. Even extremely conservative spidering jobs can impact a site negatively, if done via an unusual API. (I downloaded all two million pages of everything2.com at the pace of one per second, which averaged out to three kilobytes per second, and took a full month; yet apparently was enough to crush their antiquated database backend.)
I've had enough legal threats to last me a lifetime, so I'm trying a softer approach this time. We're looking for database dumps, like the ones wikipedia publicly offers. (http://dumps.wikimedia.org/)
I'll give them a week.