We would like to thank the NHPRC for its support of the Documenting Internet2 project.
When looking for ways to capture documentation of I2 as an organization, we soon determined that a web crawl would provide one very helpful pool of information. There are at least four basic steps to the task of archiving via a crawl: crawling, processing, mirroring, and preserving.
Questions about the process we have used to archive the I2 web should be addressed to Eric Celeste at firstname.lastname@example.org.
First, credit where it is due: The University of Michigan School of Information pioneered much of this work with their project to mirror the UMich web site. Nick Baker, in particular, did some terrific work that serves as the foundation for our work archiving the I2 site. Nick's documentation of the UMich process is the basis of this document and his scripts are the core of the scripts we present here. Thanks, Nick!
We chose to use the Heritrix crawler from the Internet Archive. While Heritrix is still a young crawler and not yet suited to wide crawls of the sprawling net, it is well enough developed for focussed crawls like that of the I2 domains. We also appreciate that Heritrix is a java application, and thus deployable across many platforms (including the Mac OS X platform we used) and that it is an open project aimed at broad adoption. The fact that Heritrix is being developed by the IA, creators of the WayBackMachine, also put us in a better position to work with the IA to get backfiles and share our archives down the road.
The crawl results in a stack of ".arc" files that then need to be processed. ARC files bundle everything the crawl finds into a set of very large files. HTML, images, sound files, movies, error messages, all of it gets stuffed into the ARC files. Heritrix can optionally compress the ARC files, but after some compression errors we just let them be. Nick put an example ARC file on his site (most ARC files are much larger than this one, though).
To spin an ARC file out into its constituent files, you can use the arc_extractor.pl script. This script creates files that have some slight changes in them from the originals embedded in the ARCs. For example, any HTML file spun out gets references to a few archivetools scripts. These help anyone browsing the raw files navigate cleanly among files generated by the arc_extractor script and jump over to the online mirror created by the scripts below.
Note, the arc_extractor is not required for mirroring the content online. Conversely, none of the scripts below are required if all you want to do is use the extracted content on a local filesystem. Using these extracted files is a whole different art.
Multiple crawls of the same site may have large amounts of duplication. Therefore, it is much more efficient to store only the changed content. The arc_optimizer.pl script reads through the ARC files and retrieves new and changed content, creating files with the ".arco" extension, meaning that they are optimized ARC files. While we keep copies of all the ARC files generated offline, we store the smaller ARCO files online for mirroring purposes.
The arco_indexer.pl script creates an external tab-delimited index of the ARCO files, including such metadata as uri, datetime, and the position of the file within the ARCO file. It could easily be modified to index the ARC files themselves if so desired.
The arco_sql.pl script transforms the tab delimited files from the prior step into a set of SQL commands suitable for importing into a MySQL database. You can run these commands in MySQL with a
source command. For example,
source /path/to/the/file.sql will do the trick.
These scripts all are run in sequence and all take proscribed input and produce predictable results. To make it simpler to sew the whole process together (and to document the usage of these scripts), we use another whole.sh shell script. You can supply this script with the path to a directory that contains an "arcs" directory (one of the job directories created by Heritrix, for example) and it will work through all these steps for that set of ARC files. Note that it assumes the directory name you supply is the name you would like to use for the table in MySQL that will index this crawl's content.
Providing web access to the data gathered by the crawl is called "mirroring" the site. In this case, we can mirror much of the I2 web presence using the data we gathered in our crawl of Internet2 sites. Our mirror is available at http://thomas.lib.umn.edu/di2/ and demonstrates how flexible this system can be. Note that the mirrored content stays in the ARCO files, it does not need to be extracted. The database index built in this step provides pointers into those ARCO files so that the PHP script can dig out content form the ARCO files on demand.
The July crawl collected about 120,000 documents (8GB of data). Each document needs an entry in the database to make lookups efficient for the mirror. The SQL script provided by arco_sql above can be loaded into MySQL with a "source" command to produce the appropriate table. The table structure will be similar to this...
+----------+-------------+------+-----+---------+----------------+ | Field | Type | Null | Key | Default | Extra | +----------+-------------+------+-----+---------+----------------+ | id | int(16) | | PRI | NULL | auto_increment | | uri | text | YES | | NULL | | | date | varchar(14) | YES | | NULL | | | content | varchar(32) | YES | | NULL | | | start | varchar(32) | YES | | NULL | | | length | varchar(32) | YES | | NULL | | | arc | varchar(64) | YES | | NULL | | | response | char(3) | YES | | NULL | | +----------+-------------+------+-----+---------+----------------+
More details about the database are in the UMich documentation.
To a user, navigating the mirror is much like navigating the real site, the user clicks on links and "moves" around the site. Under the covers, something quite different is going on. First the archive.php script connects to the database to find whether the mirror holds the desired content, and if so to learn where in which ARCO the content lives.
For non-text files, such as JPEG images, the script reads the binary information from the ARCO file, appends a header with the proper "Content-type" from the table, and returns the information to the web browser. Here is a JPEG image from the archive.
For HTML, the script makes some modifications and additions to the page. Here is an HTML page from the archive. Notice the green "DI2 Archive" tab in the upper right corner.
The ARC and ARCO files hold the content exactly as it was found. None of this source information is tampered with in any way. These archive files are also rather large. The 120,000+ documents we gathered from I2 resulted in 79 ARC files, most over 120MB a piece. For comparison, our scanned poster images often take over 300MB a piece to store. Managing the upkeep of 79 large files is much simpler than managing the upkeep of 120,000 small files. We expect to use the techniques we've established for the preservation of image files to manage the storage of these ARC and ARCO files.
Of course, lurking below the surface (or inside the ARCs) is the continuing problem of how to maintain access to the file formats that were caught in the crawl. Our July crawl summary reveals thousands of PowerPoint, PNG graphics, XML, and PDF documents, hundreds of MS Word, RealAudio, and ZIP files, dozens of other movie formats and in all about two dozen file formats. We have no magic solution to this thorny issue, we just note here that we are doing "bit preservation" without any real idea how to maintain access to these formats over time.
- The crawl does not provide any notion of when a file was created or last modified. All we know is when it was crawled. This is a notable shortcoming for future researchers.
- The crawl requires significant partnership between the crawling agency (us) and the source of the data (the site). This is especially critical if the crawl is to disobey the regular robot exclusion directives on a site.
We'd still like to do...
- A crawl of I2 that overcomes some of the robot directives.
- Adjusting our scripts and practice so that multiple crawls of the same site could be searched at once.