Skip to Content Java Solaris Communities Partners My Sun Sun Store United States Worldwide

»  Spotlight Articles
»  Projects
»  Publications
»  People
»  Awards
»  Events
»  Downloads
»  Internships
»  Contrarian Minds
»  About Sun Labs

Project Celeste: A New Model for Massively Scalable Storage

Sun Labs researchers are exploring ways to transform file storage into a ubiquitous resource that can be delivered quickly and efficiently, at a low cost, to anyone who needs it, in any quantity, at a predictable level of quality.

Read
Technical Report
TR-2007-160

April 5, 2007 - Project Celeste is investigating a radically new approach to pooling storage resources and creating a "storage utility"—essentially doing for file storage what Edison did for electricity. The exploration is being led by Sun Labs senior researcher Glenn Scott.

A Celeste system is made up of many computers (nodes), each contributing whatever storage and processing capacity it can spare to the network. Every node participates in the system using the same protocols as the other nodes, and users access the services and resources of the system with these same Celeste protocols.

It is possible to build massive Celeste systems, with an extremely large number of participating nodes, using a wide range of devices: large, high-speed servers, small laptops, dedicated storage appliances, even ad-hoc collections of spare computers. These computers need not be located in the same place or even near each other; they can be distributed on an intranet or across the Internet.

When the user stores a file on a Celeste system, he or she does not need to know or care which particular nodes contain the file, because in fact many of them do. The file is stored not as an integral whole but in "chunks" that are distributed among multiple nodes. When the user wants to retrieve the file the Celeste system reconstructs it, pulling together the needed fragments from nodes that are available and trustworthy at that particular time.

A Reliable System Built from Inherently Unreliable Parts

Celeste systems are designed to be remarkably resilient and reliable. Following the fundamental design principle of the Internet, Celeste systems are built to avoid dependence on any individual node and assume that at any given time multiple nodes will be unavailable for a variety of possible reasons: hardware failures, software errors, power outages, even malicious acts such as tampering or attacks. A Celeste system could continue to deliver its services reliably despite the loss of individual nodes.

In effect, a Celeste system is a type of organism—growing dynamically and organically, continuing to function despite constant change and occasional adversity. Celeste is even prepared for possible disaster scenarios in which the system is fragmented into multiple partitions. The Celeste system keeps operating (albeit with lesser functionality) until the disaster is overcome, then reunites the partitioned fragments.

Equally important, Celeste systems are designed to protect the integrity of the user's files and data. "Celeste systems trust no one and always assume the worst," said Mr. Scott. "They take counter- measures against system attacks and a wide range of exploits and potential cheating, and they monitor the reputation of each individual node."

Specifically, Celeste systems employ a distributed hash table-based routing system with "reputation management," a key innovation that enables the Celeste system to verify the identity of individual nodes and their behavior over time. Nodes only know about neighbor nodes and can verify those neighbors, but no overarching central command exists. Word goes from node to node and those which do not consistently pass verification tests become less used by their neighbors. Over time poor or mal-performing nodes find themselves ostracized. Simply put, word gets around when a particular node becomes suspect, and that node becomes less frequently utilized or even avoided by the system as a whole.

Adjustable Availability and Service Levels

In the same way that computer system availability levels can be adjusted to meet the service-level requirements of end users, the service level delivered by a Celeste system could be tuned to meet requirements. When a Celeste system is built out, the failure rate of components needs to be carefully considered. Systems must be designed so that the capacity to handle projected requests is available at any given time. The Celeste model is probabilistic: the greater the number of nodes, the higher the probability that a given file will be available when needed.

"We're applying the lessons we've learned over the past few years about peer-to-peer networking and shared peer-to-peer storage environments," said Mr. Scott. "The goal is to create extremely large storage networks that suffer no downtime, no data loss or corruption, and no reliability issues."

And how large is extremely large? According to Mr. Scott, Celeste systems could be scalable to more than 1024 bytes of read/write storage. For the sake of comparison, the storage systems in Fortune 500 companies today typically contain a few hundred terabytes of data, or 1012 bytes of read/write storage.

A System that Reads and Writes… and Deletes

Another key Celeste innovation is the system's ability to both read archival data and write to stored files. Celeste is a "mutable file store," which means it goes beyond the "write-once, read many" capabilities of traditional object-store technologies such as Honeycomb or Centera. In fact, Celeste users can write to a file multiple times—and Celeste keeps track of each and every modification and version number. This audit capability is particularly useful for regulatory compliance because it establishes exactly which authorized individual made which change at what time.

Celeste also enables users to delete files, which is no small feat from a technological perspective. With most file-store technology today, an actual deletion is rare. What really happens when the user hits the "delete" button is that the file is unlinked from the file system, meaning the space it occupies is available for future writing— which may or may not ever happen.

In the vast majority of consumer applications, this style of deletion is perfectly adequate—no one knows or cares what happened to the file; out of sight, out of mind. But in some cases, a true delete is required—meaning the data must be absolutely, positively gone. For example:

  • If there is a business need for absolute, irretrievable deletion, Celeste can do that.

  • Encrypted files are often deleted by simply throwing away the encryption key—but it may still be possible for someone to derive the key and reconstruct the file

  • In sensitive applications the requirement could be even stronger—to the point where there can be no trace of the file and no evidence that it ever existed

Celeste addresses each of these deletion requirements, making it possible to do complete, worry-free deletes of any archived file.

Celeste also provides a unique capability called "authorization after the fact," in which an authorized user can see the changes made to a file by another user and subsequently authorize the changes that were made. With other systems, people who are not explicitly authorized to write to a file are simply denied access.

Commercial Possibilities for Celeste

Celeste has tremendous potential for traditional storage applications and could also serve as a catalyst for a wide range of new service offerings. For example:

  • For large corporations with excess capacity in server and storage resources, Celeste can provide a way to increase utilization rates and avoid or delay capital expenditures for new systems

  • Service providers could connect to a centralized Celeste system and pay for the resources they need to serve their customers—or create their own Celeste systems and provide new value- added, subscription-based storage services

  • Major content owners, such as CNN, Time Warner, and Disney, could use Celeste systems to solve their storage capacity problems—and improve the performance experienced by end users at the same time, because Celeste could allow movies and videos to be cached locally and play right away

  • Web sites that are geographically distributed could use Celeste systems to serve Web pages; pages that are in high demand could be cached instantly while pages that are requested less could be cached less often

Sun is currently investigating these and other commercial opportunities and intends to involve the development and service provider communities in emerging opportunities.

Under the Hood

Celeste is implemented in the Java programming language and runs in a J2SE environment. The Project Celeste team has built a working prototype, which has been tested on the Solaris, Linux, Macintosh OSX, and Windows operating systems, concurrently. The prototype Celeste system can be a mixture of all these operating systems.

Project Celeste also owes a significant debt of gratitude to Berkeley OceanStore, a global persistent data store designed to scale to billions of users.

Mr. Scott and his team have been working on Project Celeste for three years now, and a number of Sun customers have taken active interest and are evaluating the possibilities for their business models.

"What we're working on now is refining the prototype so that it's simpler to use and administer; then we'll continue exploring the possibilities of the technology," said Mr. Scott. "I have to consciously avoid the temptation to create a specific product; that's someone else's job. We're here to develop this very promising technology, and we fully expect a successful transfer of that technology to a Sun product group in the near future."

For More Information

Details about the underlying technologies for Celeste can be found at the following sites:

  • Project Celeste

  • "Deleting Files in the Celeste Peer-to-Peer Storage System" In the 25th IEEE Symposium on Reliable Distributed Systems (SRDS'06), pages 29-38, October 2006.

  • Technical Report: "Maintaining Object Ordering in a Shared P2P Storage Environment"

  • An overview of OceanStore
  • Would you recommend this Sun site to a friend or colleague?
    Contact About Sun News Employment Privacy Terms of Use Trademarks Copyright 1994-2008 Sun Microsystems, Inc.