University of Minnesota wordmark

DI2
How To Crawl

We would like to thank the NHPRC for its support of the Documenting Internet2 project.

When looking for ways to capture documentation of I2 as an organization, we soon determined that a web crawl would provide one very helpful pool of information. There are at least four basic steps to the task of archiving via a crawl: crawling, processing, mirroring, and preserving.

The process described below uses Heritrix (a Java-based crawler), Perl, JavaScript, PHP (along with a web server that can serve the PHP), and MySQL. Any system that accommodate this suite of software should be capable of duplicating this work.

Credit

Questions about the process we have used to archive the I2 web should be addressed to Eric Celeste at efc@umn.edu.

First, credit where it is due: The University of Michigan School of Information pioneered much of this work with their project to mirror the UMich web site. Nick Baker, in particular, did some terrific work that serves as the foundation for our work archiving the I2 site. Nick's documentation of the UMich process is the basis of this document and his scripts are the core of the scripts we present here. Thanks, Nick!

Crawling

We chose to use the Heritrix crawler from the Internet Archive. While Heritrix is still a young crawler and not yet suited to wide crawls of the sprawling net, it is well enough developed for focussed crawls like that of the I2 domains. We also appreciate that Heritrix is a java application, and thus deployable across many platforms (including the Mac OS X platform we used) and that it is an open project aimed at broad adoption. The fact that Heritrix is being developed by the IA, creators of the WayBackMachine, also put us in a better position to work with the IA to get backfiles and share our archives down the road.

I2 provided us with a list of domains to use as seeds for Heritrix. We crawled I2 in April 2005 and again in July 2005.

Processing

The crawl results in a stack of ".arc" files that then need to be processed. ARC files bundle everything the crawl finds into a set of very large files. HTML, images, sound files, movies, error messages, all of it gets stuffed into the ARC files. Heritrix can optionally compress the ARC files, but after some compression errors we just let them be. Nick put an example ARC file on his site (most ARC files are much larger than this one, though).

arc_extractor

To spin an ARC file out into its constituent files, you can use the arc_extractor.pl script. This script creates files that have some slight changes in them from the originals embedded in the ARCs. For example, any HTML file spun out gets references to a few archivetools scripts. These help anyone browsing the raw files navigate cleanly among files generated by the arc_extractor script and jump over to the online mirror created by the scripts below.

Note, the arc_extractor is not required for mirroring the content online. Conversely, none of the scripts below are required if all you want to do is use the extracted content on a local filesystem. Using these extracted files is a whole different art.

arc_optimizer

Multiple crawls of the same site may have large amounts of duplication. Therefore, it is much more efficient to store only the changed content. The arc_optimizer.pl script reads through the ARC files and retrieves new and changed content, creating files with the ".arco" extension, meaning that they are optimized ARC files. While we keep copies of all the ARC files generated offline, we store the smaller ARCO files online for mirroring purposes.

arco_indexer

The arco_indexer.pl script creates an external tab-delimited index of the ARCO files, including such metadata as uri, datetime, and the position of the file within the ARCO file. It could easily be modified to index the ARC files themselves if so desired.

arco_sql

The arco_sql.pl script transforms the tab delimited files from the prior step into a set of SQL commands suitable for importing into a MySQL database. You can run these commands in MySQL with a source command. For example, source /path/to/the/file.sql will do the trick.

whole.sh

These scripts all are run in sequence and all take proscribed input and produce predictable results. To make it simpler to sew the whole process together (and to document the usage of these scripts), we use another whole.sh shell script. You can supply this script with the path to a directory that contains an "arcs" directory (one of the job directories created by Heritrix, for example) and it will work through all these steps for that set of ARC files. Note that it assumes the directory name you supply is the name you would like to use for the table in MySQL that will index this crawl's content.

Mirroring

Providing web access to the data gathered by the crawl is called "mirroring" the site. In this case, we can mirror much of the I2 web presence using the data we gathered in our crawl of Internet2 sites. Our mirror is available at http://thomas.lib.umn.edu/di2/ and demonstrates how flexible this system can be. Note that the mirrored content stays in the ARCO files, it does not need to be extracted. The database index built in this step provides pointers into those ARCO files so that the PHP script can dig out content form the ARCO files on demand.

MySQL

The July crawl collected about 120,000 documents (8GB of data). Each document needs an entry in the database to make lookups efficient for the mirror. The SQL script provided by arco_sql above can be loaded into MySQL with a "source" command to produce the appropriate table. The table structure will be similar to this...

 +----------+-------------+------+-----+---------+----------------+
 | Field    | Type        | Null | Key | Default | Extra          |
 +----------+-------------+------+-----+---------+----------------+
 | id       | int(16)     |      | PRI | NULL    | auto_increment |
 | uri      | text        | YES  |     | NULL    |                |
 | date     | varchar(14) | YES  |     | NULL    |                |
 | content  | varchar(32) | YES  |     | NULL    |                |
 | start    | varchar(32) | YES  |     | NULL    |                |
 | length   | varchar(32) | YES  |     | NULL    |                |
 | arc      | varchar(64) | YES  |     | NULL    |                |
 | response | char(3)     | YES  |     | NULL    |                |
 +----------+-------------+------+-----+---------+----------------+

More details about the database are in the UMich documentation.

archive.php

To a user, navigating the mirror is much like navigating the real site, the user clicks on links and "moves" around the site. Under the covers, something quite different is going on. First the archive.php script connects to the database to find whether the mirror holds the desired content, and if so to learn where in which ARCO the content lives.

For non-text files, such as JPEG images, the script reads the binary information from the ARCO file, appends a header with the proper "Content-type" from the table, and returns the information to the web browser. Here is a JPEG image from the archive.

For HTML, the script makes some modifications and additions to the page. Here is an HTML page from the archive. Notice the green "DI2 Archive" tab in the upper right corner.

This is similar to the way the Internet Archive's Wayback Machine works, but there is more functionality within the page. If everything works correctly, this should allow users to surf through the archived site as though it were still live. JavaScript and Flash can wreak havoc with the system, but well designed and accessible sites shouldn't have these problems.

The key to flexibility in this mirror is that we store the content as we found it. No changes are made to the ARCO files directly. Instead we use the script to "rewrite" the content from the ARCO as it is presented to the browser. Three important chunks of JavaScript code are inserted into every HTML file presented by the script: a header with a pointer to our stylesheet (to make the "DI2 Archive" button), a "layer" section that pops out if you click the button, and some JavaScript to rewrite URLs in the document so that they point back to the mirror rather than off to the real site. Any of this inserted code can be changed at any time and the change will modify the behavior of the whole mirror right away.

Preserving

The ARC and ARCO files hold the content exactly as it was found. None of this source information is tampered with in any way. These archive files are also rather large. The 120,000+ documents we gathered from I2 resulted in 79 ARC files, most over 120MB a piece. For comparison, our scanned poster images often take over 300MB a piece to store. Managing the upkeep of 79 large files is much simpler than managing the upkeep of 120,000 small files. We expect to use the techniques we've established for the preservation of image files to manage the storage of these ARC and ARCO files.

Of course, lurking below the surface (or inside the ARCs) is the continuing problem of how to maintain access to the file formats that were caught in the crawl. Our July crawl summary reveals thousands of PowerPoint, PNG graphics, XML, and PDF documents, hundreds of MS Word, RealAudio, and ZIP files, dozens of other movie formats and in all about two dozen file formats. We have no magic solution to this thorny issue, we just note here that we are doing "bit preservation" without any real idea how to maintain access to these formats over time.

Other Issues

We've noted...

  • The crawl does not provide any notion of when a file was created or last modified. All we know is when it was crawled. This is a notable shortcoming for future researchers.
  • The crawl requires significant partnership between the crawling agency (us) and the source of the data (the site). This is especially critical if the crawl is to disobey the regular robot exclusion directives on a site.
  • The JavaScript URL rewriting method, while really cool and very functional for simple web pages, can fail miserably when pages use scripts themselves to construct links on the fly. I'm sure AJAX and other new techniques will completely undermine JavaScript URL rewriting in the not too distant future. Even Heritrix itself can fail to crawl som of these sites properly. http://thomas.lib.umn.edu itself is an interesting example of a page that fails to be crawled properly. The "weblog" link in the right sidebar is missed entirely, since it is scripted.

We'd still like to do...

  • A crawl of I2 that overcomes some of the robot directives.
  • Adjusting our scripts and practice so that multiple crawls of the same site could be searched at once.