Skip to Content Java Solaris Communities Partners My Sun Sun Store United States Worldwide

»  Sun Labs at RSA 2004 Conference
»  XACML 1.0 Implementation Extends "Boundaries of Trust" for e-business
»  Jackpot!
»  License To Innovate
»  Trump Card - Nothing Beats Sun Labs' "Ace" Technology for Fast Development of Flexible, High-Performance, Enterprise Applications
»  All IP Wireless, All the Time
»  Sun Labs: Ten Years of Impact
»  Beating the Clock
»  Staying in Touch - Awareness for Remote Workers
»  Engine of Innovation: Sun Labs Transforms Big Ideas into Practical Technologies
»  Crypto-Politics: Decoding the New Encryption Standard
»  Additional Feature Stories
Feature Story

LOCKSS: Protecting and Preserving Web Documents

LOCKSS logo

The transitory nature of Web content is a fact of digital life that affects everyone. How can you find documents posted by publishers who are now defunct? How can you protect archived publications from hazards such as fires, floods, or human error? How can you ensure that your published materials will always be found by interested (and authorized) readers? Ensuring continuous access to online scientific journals and other Web documents is the focus of a unique collaboration between Sun Microsystems Laboratories' David Rosenthal and Stanford University Library's Vicky Reich. And the result is the LOCKSS (Lots of Copies Keep Stuff Safe) system, an exciting new data integrity and document protection solution.


The Key to the LOCKSS System

The goal of the LOCKSS project is to enable libraries to take custody of the material to which they subscribe--in the same way they do for paper--and preserve it permanently. Using a clever polling system, the LOCKSS system permanently caches copies of online content--enough copies to assure continuous access around the world. This helps ensure that links and searches by authorized individuals continue to locate the published material even if it is no longer available from the publisher. And when a copy of an online journal is misplaced or damaged, the LOCKSS system takes notice and replaces it.

A comparable system might have saved much of the world's literature lost in the fire that destroyed the Library of Alexandria in 415 AD.

The concept behind the LOCKSS system is based on simple rules. Acquire lots of copies. Scatter them around the world so that it is easy to find some of them and hard to find all of them. Lend or copy your copies when other libraries need them. And collaborate only with competent and trusted libraries. Reich adds a further proviso that it runs on cheap, slow, old computers "stolen from the junk heap."

Unique Integrity System

A sophisticated polling technique and a unique integrity system complement the modest hardware requirements. Each participating library runs a LOCKSS daemon implemented in Java that behaves as a Web cache. The process begins when a librarian supplies it with a publisher's URL and publishing frequency. The publisher uses the daemon's IP address for authentication, and the LOCKSS daemon then launches a Web crawler that navigates and traverses the publisher's sub-trees, fetching a copy of the journal page by page.

LOCKSS
How the Data Flows

The library caches communicate with each other continually but very slowly, using the Library Cache Auditing Protocol (LCAP), which Reich and Rosenthal created. LCAP uses unicast and multicast IP datagrams to enable the LOCKSS daemons to challenge each other to vote in polls proving that their respective copies of journal volumes, issues, and articles are the same. If a daemon loses a poll, it fetches a new copy of the damaged content from the publisher or from one of the winning daemons. This mechanism is analogous to inter-library loans.

The system's reliability depends not on the LCAP protocol, which is itself unreliable, but on the presence of large numbers of replicas and the voting mechanism. LCAP provides "public" communication--a daemon cannot be certain which other daemons heard a message it sent. This enables daemons to make their own estimates of the credibility of other daemons using a reputation system.

Perhaps the most intriguing part of the integrity system is that it is designed to leverage the unique characteristics of the LOCKSS system. Because it is not centrally administered but rather distributed, there is no single point of failure. Because LOCKSS runs very slowly it means that an attacker "must persist in taking bad actions over a long period of time," according to Reich and Rosenthal. "By operating slowly even on human time scales, the system makes it easier to detect an attacker and limits the damage he can do before being stopped."

The LOCKSS integrity system is forgiving, too, which is remarkable for an autonomous caching system. For this it relies on maintaining a record of public behavior. Since each cache maintains a registry of every other cache's polling behavior, mistrusted caches are eventually excluded from polling, copying, and lending operations. If the mistrusted cache changes its ways, demonstrating its credibility in a sufficient number of polls over time, it is readmitted to the peer group and then granted voting and lending privileges in the LOCKSS system.

Reich and Rosenthal are quick to point out that theirs is not "a general-purpose Web content preservation system." LOCKSS is designed only for Web journals such as those published by Stanford's High Wire press. To be sure, LOCKSS' slow, methodical polling and copying system is "clearly not suitable for volatile content" such as that of a CNN news site. But Reich and Rosenthal do allow that "it may be possible to apply the system to other types of content."

Running LOCKSS

The current LOCKSS version runs on generic PCs. At current prices, a suitable machine with a 60GB disk in a 1U rack-mount case should cost about $750. The system is distributed as a bootable floppy disk. The system boots and runs Linux from this floppy; there is no operating system installed on the hard disk. The first time the system boots it asks a few questions, then writes the resulting configuration to the floppy, which is then write-locked. At any time, the system can be returned to a known-good state by rebooting it from this write-locked disk.

Each time the system is booted, it downloads, verifies, and installs the necessary application software, including the daemon that manages the LOCKSS cache and the Java™ virtual machine needed to run it. The system then runs the daemon and starts the HTTP servers that provide the user interface Web pages. The cache's administrator can use these pages to specify the journal volumes to cache and monitor the system's behavior.

Summary: The Promise of LOCKSS

For scientists, librarians, and publishers who are concerned that the digital material that has become the record of science will prove as evanescent as the rest of the web, the LOCKSS system is a very promising solution. It has the capability to deliver on a wide spectrum of needs:

  • Providing future generations of scientists with access to all current literature for research, teaching, and learning.
  • Ensuring that current and future librarians have an inexpensive, robust mechanism--which they control--to provide their communities with long-term access to essential literature.
  • Providing current and future publishers with an assurance that their journals' editorial values and brands will be available only to authorized and authenticated readers.

Project Status

The LOCKSS system is in the midst of a major test involving libraries and publishers around the world. As of September 2001, 45 libraries on five continents have signed onto the project (including Harvard University, Library of Congress, New York Public Library, Los Alamos National Laboratory, and the British Library), and 53 publishers are endorsing the LOCKSS beta test. Up-to-date project status is available at:

http://lockss.stanford.edu/projectstatus.htm

Acknowledgements

The Stanford University Libraries LOCKSS team members are:

  • Vicky Reich
  • Tom Robertson (HighWire Press)
  • David Rosenthal (Sun Microsystems)
  • Mark Seiden and Tom Lipkis (Consultants)

The National Science Foundation, Sun Microsystems Laboratories, and Stanford University Libraries funded development and alpha testing of LOCKSS. The worldwide "beta" test in 2001 is made possible through a grant from the Andrew W. Mellon Foundation, equipment donated by and support from Sun Microsystems Laboratories, and support from Stanford University Libraries.

We are grateful to the contributors at our alpha sites:

  • Dale Flecker and Stephen Abrams
  • Rick Luce and Mariella DiGiacomo
  • David Millman and Ariel Glenn
  • Bernie Hurley and Janet Garey
  • Chris Hodges and Hal Clyde King
  • and Jerry Persons

Special thanks are due to:

  • Michael Lesk
  • Michael Keller
  • Bob Sproull
  • Neil Wilhelm

Related Links

«Return to feature story

Would you recommend this Sun site to a friend or colleague?
Contact About Sun News Employment Privacy Terms of Use Trademarks Copyright 1994-2008 Sun Microsystems, Inc.