|
| United States Worldwide |
|
Free Your MindHow a Small, Persistent Team Created a Revolutionary File System By Al Riske 28.Sept.06--Like all good engineering efforts, ZFS, the revolutionary new file system in Solaris 10, began in a moment of anger. It was 1996. The Solaris team had just moved into new digs in Menlo Park, California, and the server was down. An upgrade from one generation of disks to the next. "None of us had access to any of our home directories. Our mail, our calendars -- all gone," recalls Jeff Bonwick, now a Distinguished Engineer and CTO of Storage for Sun.
For a few hours, no one could do much of anything but talk. "I was over in Tim Marsland's office and we were griping about what a ridiculous pain this was and asked, 'Why is it like this anyway?'" Bonwick says. "If I want to add more memory to a machine, it's no big deal. I just pop off the cover, stick in some more DIMMs [dual in-line memory modules], power it up, and everything just gets better." Bill Moore, who would eventually join Bonwick as co-lead on ZFS, interjects a simple example: "Firefox needs more memory? It gets more memory. No need to run dimmconfig or create virtual DIMMs. The operating system manages memory as a resource." "So we were looking at this and saying, 'A disk is so much like a DIMM really -- the disk is persistent, the DIMM is not, but other than that ... Why can't we do the same thing for disks that we've done for memory?'" Bonwick says.
There were a lot of reasons, apparently. While Bonwick and Marsland, a fellow Solaris engineer, were brainstorming, another colleague stopped by and said, "Well, you can't just do that because the system administrator has to know this disk is paired with that disk and ..." "He had been around UNIX forever and had a certain view of the world based on what his experiences had been, so we felt a little bit humbled," Bonwick says, "but we were still thinking, 'Naw, there's got to be a way to pull this off.'" Four years would go by, however, before he got to take a shot at it. "In 2000, I was in between projects and trying to decide, 'What do I want to do next?' Well, we'd had three different groups that nominally had the charter for creating a new file system. None of them had quite managed to pull it together, even though there were some good ideas that had been floated. So I decided, however cursed a project this seems to be, I'm going to give it a shot," Bonwick says. "Had you known ..." Moore chuckles. "If I had any idea how hard it was going to be, I wouldn't have done it," Bonwick says. "Ignorance is very powerful."
He asked for a small team -- five or six people. His manager gave him 80. "The thing is, you can't show up out of the blue with some wacky idea and expect 80 people to just sign up -- like, 'Yeah, that's righteous. That's the way to go,'" Bonwick says.
For a year, he tried to make it work, fighting friction all the way, and finally decided he had to start over. "The only reason I didn't give up entirely -- I was so disgusted with the politics -- was because I knew we had hired Matt Ahrens to work on this about a year earlier. It would be kind of crappy to say, 'Hey, you know that thing we hired you to work on? I gave up.' So, I decided, I have to give this one more shot," Bonwick says. "Once it was just the two of us, Matt and I were having a great time at the white board five hours a day, going through all kinds of ideas. We started coding almost immediately. That was in July of 2001. By Halloween, we had the first basic version of it working in userland. A year later, we had it working in the kernel."
"I showed up about six months after kernel mount. June of 2003," Moore says. Moore had first joined Sun in 1996, working on the server side, but left in 1999 to form 3PARdata, an enterprise storage startup, where he was involved in every aspect of the company, including overall hardware and software design. The experience would prove valuable when he rejoined Sun in 2003.
"I was hired to work on x86 performance, but every time I'd see Jeff he'd say, 'Hey, check out this file system and look at what we're doing here,'" Moore recalls. "He was all chummy. 'Here, unplug my machine. No, really, unplug it.' So I did, and it showed that all the files were still there." Moore quickly got the impression that Bonwick was trying to recruit him to work on ZFS. "I installed ZFS on a machine, started trying it out, and as soon as I start writing some files to it, I hear the disks go eeee, errr, bzzz. Having worked for a storage startup, I knew the sound of unhappy disks when I heard it," Moore says. He thought he would have to write a kernel instrumentation framework to collect data on what was happening. "I was getting ready to do that when Bryan Cantrill shows up and says, 'There's this DTrace thing I'm working on. Why don't you use that?' So I get some build out of his home directory and install it on my desktop, because DTrace hadn't integrated at that point, and I tried it out," Moore recalls. Within five minutes, he had the I/O data he needed. "So I plot logical block address versus time and see there is a solid band of low logical block addresses and a solid band of high ones. Essentially what this means is the drive was just seeking back and forth between these two bands as fast as it could. I know how to fix that," Moore says. "So I asked Jeff, 'Are you guys going to do anything around I/O scheduling?' He said, 'Naw, disks are getting smarter. We'll just send all the I/Os down to the disks and let them handle it.' I said, 'Yeah, that's not really how disks work.' And that's how I started working on ZFS. I did an I/O scheduler."
Simply put, ZFS is a new kind of file system that provides simple administration, transactional semantics, and end-to-end data integrity. Today it's recognized as one of the features that makes Solaris 10 the most advanced operating system on the planet, but all along the way, there were obstacles -- both political and technical. The chief political concern among engineers is that there's no such thing as buy-in, only rent-in. It's often the case, for example, that a project will have a longer lifespan, so to speak, than the vice president managing it. So, while the first VP rotates into a new assignment, the new person may come in with a different set of priorities. "Meanwhile you've got this thing that's consuming five or six of the very best people and doesn't seem like it's going to end anytime soon. Because it does take awhile," Bonwick says. Technical concerns, as noted earlier, were raised from day one. "There was a lot of just plain disbelief in the beginning that it could be done at all, because we were putting so many things on the table," Bonwick says. "Some of the concerns ended up being legit, but I think the main thing I bring to the table when I take on a new problem is that I'm not interested in how it's been done before. I only want to know, of all the constraints people tend to assume, which ones are actually fundamental and which ones are just habit and can be thrown away?"
"Right," Moore says. "So one way to say it might be that what you bring to the table is complete and utter ignorance of the problem you're about to work on." "Ignorance and a willingness to leverage it," Bonwick replies, clearly accustomed to Moore's teasing. "There is a downside to having a bunch of expertise in something, because you do start to see the problem a certain way. That's what it means to be expert in it. You have that perspective. So, yeah, I'm a big fan of bringing in novices to deal with things that are broken at the deepest conceptual levels." "Free your mind," Moore says. "That's from The Matrix, right? It's one of the things we had to keep doing ... The people who wrote the technology that preceded ZFS were not all dumb. They did everything for a very particular and, at the time, very good reason. But the world is different now than it was 15 years ago, so the question was: Are those reasons still valid?" "In the end, the job of a file system is to read and write blocks, such that what you read is identical to what you wrote previously. That's the fundamental guarantee for system storage, but there are a number of things that can cause that not to work out," Bonwick says. "You can have media errors on disk where some of the bits just rot, or one of the disks can go bad on you. You can get bit flips on the way from the disk into the machine." "Firmware on the disk can give you the wrong block by mistake," Moore interjects. "All kinds of crazy stuff can happen." "That has made providing end-to-end data integrity more important going forward," Bonwick says. "But the Prime Directive of file system design has always been: No bcopy()s. Never touch the data." Too costly in terms of performance, Moore explains. "That was a valid assumption at the time because performance really would be killed by making a pass over the data 15 years ago," Bonwick says. "But CPU is now 10x memory and 100x disk compared to what it was a decade and a half ago. That means that now, when you want to make a pass over the data to checksum it, you're talking about some modest percentage of your CPU as opposed to eating the whole thing." So one of the things ZFS does is checksum the data, every time. "It's no longer the onerous burden it once was to make sure all your data is intact, and given the size of disk drives -- 500 gigs today and soon 1 terabyte -- the alternative is really unacceptable. Nobody wants their system down for hours while it runs a file system check," Moore adds. "Furthermore, if you were to have a genuine file system error that went undetected, as it would if you didn't have checksums, then you'd have to recreate all that data. That's a lot of typing." "With ZFS, end-to-end data integity isn't just for enterprise-class storage systems anymore. If you have a single 80GB laptop drive, and you lose a 1GB region due to some catastrophic event, the other 79GB is still fully accessible. That's because we replicate ZFS metadata on different parts of the disk that are physically far apart," says Bonwick. "That's something no other file system on the planet today can do," adds Moore.
Historically, the interface between the layers of the storage architecture -- file systems, volume managers, and disks -- has been exceedingly simple. Read a block/write a block. Nothing more. That had its advantages, but also its drawbacks. You couldn't say, for example, "The following five blocks must all be written together or not at all." And there was no way to let the software optimize performance for you, on the fly, as ZFS can. "If all your file systems were getting exactly equal loads, then the old model would be fine," Bonwick says, "but usually load is spikier than that, and what might be happening is that these five file systems have no load at all, and one file system is transferring some ginormous file." So Bonwick and Ahrens came up with the centerpiece of ZFS: the data management unit, or DMU. "The interesting thing about it is that the interface it presents to the world is not files and directories or anything of the sort. It's objects and transactions on those objects," Moore says. "If a rename comes in, for example, the DMU can describe the work to do: 'As part of this transaction, I want to write to the source directory to remove reference to the file; I want to write to the destination directory to add a reference to the file; I want to write to the metadata file to record the time of the move; and then I want to close this transaction.' It's not some random stream of writes. The DMU bundles the transactions into a transaction group -- a whole bunch of work that either has to succeed or fail as a whole. Just like a database," Moore explains. "Now that we have a semantically rich description of what needs to happen, we can say, 'You know what? First of all, if I can put this data anywhere I want, then the best performance would be to put it here. So let me do that. Furthermore, it would be nice to order the I/Os like this so it will go faster on the disks. Well, let me check the transaction constraints on those I/Os to see if I'm allowed to do that.' It has all the information it needs to do the best job of optimization."
Bonwick and Moore both talk to customers frequently and say they have been amazed and gratified by the reception ZFS has received. "It's actually somewhat astounding," Moore says. "I knew from working at a storage startup that the world is feeling a lot of pain in terms of keeping all their data intact and up to date, but I didn't realize to what degree. Whenever someone hears about ZFS, the initial reaction, almost universally, is, 'Wow, I've got to get that and use it right now. That would solve...' and they name whatever their problem is." "We were all toiling in anonymity in our caves for a long time, and it was really an act of faith. When you take a new approach to an old problem, there's no guarantee it'll resonate in the marketplace. You never know, really, until you ship it," Bonwick says. "But what we've found is, by and large, the problems our customers have and the problems we have ourselves are often the same problems," he adds. "When Tim Marsland and I were having that initial conversation, we were viewing the problem not as engineers but as customers." Bonwick and Moore have had fun taking the Jeff and Bill Show on the road. "We've never had to sell ZFS. All we've had to do is explain it," Bonwick observes. "And it's fun to watch the dynamics of the room change. We often start out with a room full of crossed arms and a 45-minute time slot. Two hours later everyone's leaning forward in their seats, bug-eyed and asking questions." |
Download the Solaris 10 OS Today
Get the latest version of the most advanced operating system on the planet, now with Predictive Self Healing for x64 systems.
|
||||||||||||||||||||||