|
Solving the Storage ProblemDick Sillman and the Honeycomb team deliver a radical solution. Story by Al Riske. Photography by Howard Friedenberg. 28.Jun.05--Dick Sillman is the kind of guy you want working in your lab. To him, it's all about getting innovative products out the door and making money for the company. The sooner the better. His latest project, a team effort known as Honeycomb, recently transferred from Sun Labs to the Network Storage Group and should be bringing in revenue for Sun before the end of the year.
The thinking behind Honeycomb: "How do you design a system that not only tolerates component failures, but just assumes things will constantly come and go?" Sillman, who has a knack for simplifying complex concepts, describes the work of his software teammates this way: "They made it so the Honeycomb software always knows which processors and disks are healthy and which are no longer responding. Then, as new tasks come in, it assigns those tasks in parallel. Which makes searching and retrieving files from giant data stores incredibly quick." What's more, Honeycomb makes it easy for a company to add capacity, because it automatically recognizes new elements and puts them to work. "The software says, 'Hey, look, more capacity. Cool. I know what to do with this,'" Sillman explains. Another innovation is the way Honeycomb minimizes the probability of data loss. "What we do," Sillman says, speaking as team leader, "is we take an object and slice it into fragments. Then we have a very clever, patented placement algorithm that distributes the fragments throughout the array in such a way that data can always be regenerated, even if multiple components fail."
Sillman came to work for Sun, the first time, back in 1985, and put out several products, including the world's first desktop RISC machine, which landed on the cover of the EE Times. In 1991, he became a charter member of Sun Labs, where his first project, ROVER -- short for remote operation of a vast electronic resource -- proved to be a little too far ahead of its time.
"This was before the Internet, before the Web," he says. "The LCD panels were expensive and I couldn't get people outside Sun to understand the value of a network connection back then." The lessons he's learned over the years -- not only at Sun, but when he left to join WebTV and, later, Andes Networks -- have enabled him to develop an effective, if somewhat contrarian, approach to his wide-ranging endeavors. "A successful formula for me is always trying to work on something where I have no idea what we're doing. In other words, no preconceptions. I'm not saying expertise is bad -- on the contrary, it's vital -- but a lack of preconceptions can make it easier to break out and do something radically different." His own approach is to start at the bottom, work his way up, then go back to the bottom in search of a new challenge. "I go into a place where I say, 'I'm going to fail. I will never work in this valley again. How can I possibly have said yes to this job when I have no idea what to do?' But the best engineers love to learn," he says.
"I tell myself I'm gonna learn ... and frankly it's this combination of being so excited about everything is new and kind of scary, but there's also this adrenaline rush of 'Oh, man, I've committed. I've got to deliver this stuff. I'd better get cracking.'" Sillman also believes the truth lies in the details and wants to know everything he possibly can about a project, good and bad. The challenge for any manager, he says, is that people often think they have no choice but to lie. "What are they going to tell you? 'Well, I've really screwed this up for you, boss.' Nobody is going to say that. They're all going to say, 'Oh, it's great, it's great. I could use a few more people, but other than that we're good here.' "What I'm noted for as a manager is you're never going to get punished for saying what's true," he says, and then adds with a laugh, "You might get in trouble for making it true, but not for saying it's true. Saying what's true is the best way to stay out of trouble." Sillman returned to Sun in 2003 to join the nascent Honeycomb project and used what he calls "the virtual corporation technique" to keep the team as small as possible. "Anything that we could explain, we farmed out. Anything we had to think about, hmmm, we kept that for ourselves," he explains. "I said, 'I don't want to bulk up, because a) it will be harder for me to know what's happening and b) I need the flexibility. This is a research project. I don't know where it's going."
As great as the software is, Honeycomb really shines as a storage appliance, a highly simplified solution with just three parts -- power supply, disk drive, and processor -- and no single point of failure. "It's all about clustered replication, which means tiles and tiles of computers that know how to share work," Sillman says. One of the basic observations that drove the creation of Honeycomb is that disk drives are cheap and getting cheaper all the time, so why not use that fact to create reliability on the cheap? "The components are so cheap now that we can simply make sure there are a few hundred of everything," Sillman says. He compares it to a string of Christmas tree lights. "If you lose a couple of bulbs over the course of the season, it still looks like a Christmas tree."
With this design, he says, there's no need to rush out and replace a disk or a processor or a power supply that's no longer working. Those things can be replaced weeks or months later -- and so easily, the guy who services your vending machines could do it. In the meantime, the system simply keeps working. Making sure your data will always be there is one thing, Sillman says. It's another to actually mine the data for valuable information. "So one of the contrarian things we did was we said, 'What if we embedded a bunch of horsepower right alongside the drives?' So, instead of trying to move large data sets to supercomputers, you can actually crunch the data from within the storage array," he says. "There are a number of people in the scientific community who are, like, 'Thank you, thank you, that's what we've been asking for for a long time now.'" In terms of performance, the part Sillman and team thought people would care about most was retrieval performance. "We do very well in that, but we do even better on rebuild time and that turns out to be even more important to customers, because they want to close 'the window of vulnerability' as quickly as possible," he says.
"As disk drives get bigger and bigger, now you're losing 500 gigabytes with one drive failure. So when there has been a failure and now it's time to take the parity fragments and regenerate data on the fly, we're able to show a massive improvement. By using multiple Opteron processors to work this problem in parallel, we can rebuild in an hour what used to take eight hours." In short, Honeycomb takes a comprehensive approach. "With Honeycomb, you take it out of the box, plug it in, and it's ready to store stuff. There's a lot less installing of this and configuring of that. And when you run out of space, you order up some more stuff from Sun, then you turn that stuff on and they all find each other." In other words, Sillman says, "Many of the things that now keep IT professionals busy for weeks on end -- we've done most of that in the software instead. That's part of the mentality of saying, 'How do we get the cost of ownership down?' Well, we could use computers. There's an idea." |
|
|||||||||||||||