|
| United States Worldwide |
Paperless Publishing with a Twist: It May Work
Sun Microsystems Laboratories and Stanford University are testing a new, real-world approach to preserving access to scientific, technical, and medical journals. It promises to help libraries better exploit Web technologies. In the bargain, it may make good on another promise: using information technology to reduce the flow of paper worldwide. Using Java[tm] and Linux technologies, Stanford Library and Sun Microsystems Laboratories researchers have adapted a centuries-old model for circulating paper to create one that reduces our reliance on it. It happens millions of times a day all over the wired world: Search the web. Find a URL. Go there. Bookmark it, ...and then print hardcopy. A commonplace event at work and home, printing paper is a habit we can't seem to kick. Its persistence has surprised and stumped analysts who once boldly imagined the Paperless Office. In fact, office and home printing is driving a boom in demand for paper, one that is predicted to double office paper consumption by the year 2003. Two recent workplace surveys confirm the trend and suggest an unlikely culprit: the Web itself. The surveys found that information workers are increasing their Internet printing volume at home and at the office. The bottom line: we print hardcopy because we are concerned that the information may not be on the Internet the next time that we need it. Our concern is well founded. The impermanence of Web content is a reliable fact of digital life that affects everyone. It is also the focus of a unique collaboration between Stanford University Library's Vicky Reich and Sun Microsystems Laboratories' David Rosenthal. Their goal is to provide libraries with reliable, persistent access to on-line journals. Using Java technology, Reich and Rosenthal have adapted a centuries-old model for circulating paper to one that reduces our reliance on it. Lots of Copies Keeps Stuff Safe This month, with a grant from the National Science Foundation and support from Sun Microsystems, they began alpha testing their solution at six libraries from Harvard and Columbia to Stanford and Berkeley. The content is Science, a magazine published by American Association for the Advancement of Science (AAAS). Reich and Rosenthal call their new system LOCKSS, for Lots of Copies Keeps Stuff Safe, a name that belies the inventiveness of their approach.
LOCKSS is an open source, JavaTM and Linux-based distributed system. It is designed to operate on slow, inexpensive hardware without central administration. Running autonomously and deploying a clever system of polling, LOCKSS permanently caches copies of on-line content -- enough copies to assure access around the world in case publishers fold or no longer support user access. So when an issue of an on-line journal is misplaced or damaged, LOCKSS takes notice and replaces it. Reich and Rosenthal's immediate target is libraries, where the advances of on-line publishing have been held back by libraries' reluctance to subscribe exclusively to on-line editions of journals. For Reich, assistant director of Stanford Library's Highwire Press, the attraction is irresistible. The HighWire Press publishes the on-line editions of approximately 210 STM journals, publishing a new page every few seconds, 24x7. Reich and her colleagues have championed the move away from paper, developing user-friendly techniques that STM audiences take for granted. These, combined with hyperlinks to related articles, bibliographies, footnotes and improved searchability make the Web versions easier and faster to access and more useful than paper editions. Many on-line STM journals now publish earlier and contain more information than their paper editions. On the one hand, libraries such as Stanford's are eager to provide online access to scientific, technical and medical (STM) journals because the Web is a far more effective medium than paper. On the other hand, what happens when the on-line publisher fails, or arbitrarily decides to deny access to its archives? "Preservation is totally at the whim of the publisher," notes Rosenthal. The publisher may promise 'perpetual access,' but there is no business model to support the promise." "Paper does have one essential property the Web lacks, permanence," observes Reich. LOCKSS figures to change that with an approach that combines the advantages of on-line publishing with the centuries-old craft of library management. Affordable Web Cache "Librarians' technique for preserving access to material published on paper has been honed over the years since 415 AD, when much of the world's literature was lost in the destruction of the Library of Alexandria," Reich and Rosenthal observe in a paper to be presented at Usenix in June. The" fundamental requirement" for LOCKSS was to model the best library techniques as closely as possible for material published on the Web. A comparable system might have saved much of the world's literature lost in the fire that destroyed the Library of Alexandria in 415 AD. Those techniques are based on simple rules. Acquire lots of copies. Scatter them around the world so that is its easy to find some of them and hard to find all of them. Lend or copy your copies when other libraries need them. And collaborate only with competent and trusted libraries. These are the design principles that LOCKSS implements, with a further proviso that it runs on cheap, slow, old computers "stolen from the junk heap," says Reich. Unlike archival systems that preserve copies at any cost, LOCKSS preserves access for circulating journals with a frugality that will make it affordable to perennially cash-strapped libraries. For the alpha test now underway, "we're using really old, beat-up 75 and 100 MHz Pentiums," says Rosenthal. A sophisticated polling technique and a unique security system complements the modest hardware requirements. Each participating library behaves as a Web cache. The process begins when a librarian supplies an instance of LOCKSS with a publisher's URL and publishing frequency. The publisher uses the library's IP address for authentication, and LOCKSS then launches a web crawler that navigates and traverses the publisher's sub-trees, fetching a copy of the journal page by page.
The library caches communicate with each other "in wall-clock time," using the Library Cache Auditing Protocol (LCAP), which Reich and Rosenthal created. LCAP is a reliable, scalable IP multicast protocol that continuously polls member libraries to check for missing or damaged copies. LCAP takes advantage of multi-threaded Java code, so a variety of processes can run in the background. Among them, random polls in which the caches run "diffs" on their respective copies, comparing content by walking through directories by journal, volume, and issue. Damaged or missing copies trigger a replication process modeled on the inter-library loan system, but only after LOCKSS conducts an "opinion poll" on the competency of the library with the problem cache. Perhaps the most intriguing part of the security system is that is designed to leverage the unique characteristics of the LOCKSS system. Because LOCKSS is not centrally administered but rather distributed, there is no single point of failure. Because LOCKSS runs very slowly it means that an attacker "must persist in taking bad actions over a long period of time," according to Reich and Rosenthal. "By operating slowly even on human timescales, the system makes it easier to detect an attacker and limits the damage he can do before being stopped." The LOCKSS security system is forgiving, too, which is remarkable for an autonomous caching system. For this it relies on maintaining a record of public behavior. Since each cache maintains a registry of every other cache's polling behavior, mistrusted caches are eventually excluded from polling, copying, and lending operations. If the mistrusted cache changes its ways, demonstrating its reliability in a sufficient number of polls over time, it is readmitted to the peer group and then granted voting and lending privileges in the LOCKSS system. Reich and Rosenthal are quick to point out that theirs is not "a general-purpose Web content preservation system." LOCKSS is designed only for journals published by Stanford's Highwire press. To be sure, LOCKSS slow, methodical polling and copying system is "clearly not suitable for volatile content" such as that of a CNN news site. But Reich and Rosenthal do allow that "[i]t may be possible to apply the system to other types of content." Its affordability and ease of use are promising. Adds Reich, "It certainly will reduce the need for paper." And that's a promise that LOCKSS and similarly designed systems may keep. Related Links
What It Means To You
| |||||||||||||||||||||||||