Skip to Content Java Solaris Communities Partners My Sun Sun Store United States Worldwide

»  Contrarian Minds Archive

Computer, Heal Yourself

How a Distinguished Engineer turned buzz into reality.

By Al Riske

14.Nov.06--The predictive self-healing properties in the Solaris 10 operating system were born of a frustration that anyone who has ever used a computer can understand. It's the feeling you get when an error message appears on your screen and you don't know what it means or what you're supposed to do about it. Should you jam your foot through the monitor or simply throw it through the nearest window?


Enter Mike Shapiro.

A Distinguished Engineer in Solaris kernel development, Shapiro led the effort to create Sun's software approach to reliability, availability, and serviceability. In Solaris 10, he worked on the design and engineering of the fault-management and service-management features that comprise self-healing, and on DTrace, with long-time pal Bryan Cantrill.

The term "self-healing" has been bandied about in the computer industry pretty freely, and Shapiro recalls seeing billboards and advertisements for unbreakable this and self-healing that, during the five-plus years he worked on the project.

"The reality was, if you looked behind the billboard, it turned out to be just the same stuff as before, with maybe a little better service plan or a little better error messages," he says. "No one had really done anything that seemed particularly innovative in this area in terms of really changing the experience for failures."

In fact, Shapiro and his colleagues were doing the hard work to take "self-healing" from marketing buzz word to reliable reality.

"Yeah, yeah, everyone tells me their servers are self-healing. The reality is I still have five guys in suits running around in the datacenter doing the same things they used to, so how is this any different?"

Jaded customer's initial reaction

 

The story begins in 1999 with an e-cache issue that was bedeviling Sun customers that fall. Shapiro became part of a cross-functional team -- hardware and software experts -- who were called in to figure out what to do.

The solution: More sophisticated error handling in the operating system and a clever piece of software that functioned as a kind of scrubber to keep errors from building up in the cache (a function now hardwired into the microprocessor).

Later, when work began on Solaris 10, it occurred to Shapiro that they had been approaching the problem all wrong.

"Customers were seeing these error messages, and of course no one could understand them. Our field couldn't understand them. None of the people who wrote them could even understand them because they'd all forgotten about them. So we went through this very detailed exercise of improving the error messages," he recalls. "After we did that, we had this team, no more than five people, who could interpret all these new-fangeled messages, because they really exposed a lot of low-level details of the processor."

If a customer had a problem, the team would comb through a stream of messages to figure out what was happening inside the system.

"The thing that started occurring to me was -- and this sounds obvious but at the time was a radical idea -- 'Why do I have five people in a room going over this stuff?'" Shapiro says. "I see there's a whole bunch of complex logic that's in the heads of these people and they're just taking these messages and applying that logic over and over again. In other words, there's an algorithm."

"The approach we've taken is much more data-driven and reality-driven. When we've talked about our features, we've talked about them by showing them to people."

Mike Shapiro
Distinguished Engineer
Sun Microsystems

 

The way Shapiro sees it, an error message is the computer equivalent of a sore throat or a fever.

"Those are symptoms you can observe and describe, but they're not the real problem," he explains. "If you go to the doctor and say. 'I have a headache and my throat hurts and my eyes are red,' your doctor's job is to look at those symptoms and root cause that to the underlying problem."


We monitor the health of computer systems with various detectors in the hardware and bits of code that test the integrity of data, but, as in the human body, there are many problems that have the same symptoms. Conversely, many different symptoms may have the same root cause.

"You can have failure modes where one underlying failure can manifest itself as tens, hundreds, or even thousands of symptoms that propagate up through the stack of software. The device driver sees this, the filesystem sees that ... so you get this cascading effect," Shapiro says. "If you go to a file and you see a hundred error messages, that doesn't tell you, Was there one problem? Were there 100 problems? Were there 10? Three? It doesn't tell you any of that, because you have to figure out some causal relationship between those things. What do they mean?"

So the big idea behind self-healing is shockingly simple.

"Instead of having software spit out error messages, we take those error messages and turn them into, not messages intended for people, but messages intended for another piece of software," Shapiro says. "The stream of error messages becomes a stream of telemetry, and then we have other pieces of software called diagnosis engines, and they're like the doctors. They know that for some set of problems that exist in the world, you might see this set of symptoms and they might have these interrelationships. It's like having a little expert system built into the computer that knows how to match those symptoms to a problem."

Simple idea, complex execution. But now the user experience is radically changed.

"Instead of seeing a whole bunch of crazy error messages, you see one message that gives you a description you can understand and tells you what's really important to you, which is what to do about it. Should you pull out a disk? Which one? Do you need to replace a CPU? Order a new part? All that kind of thing," he says.

But of course Shapiro and his colleagues -- Cindi McGuire, Gavin Maltby, Stephen Hahn, Liane Praza, and other engineers from the Solaris team -- didn't stop there.

"The next part is to have the computer not only tell you what the problem is but to take automated action. To do something. For example, if a piece of software seems to have a bug in it, maybe we can't figure out exactly where the bug is, but we can identify the software component that seems to have the problem and we can automatically restart it for you," Shapiro says.

Though predictive self-healing can't fix hardware problems, it can work around them -- and that can be almost as effective, given the memory capacity and multicore, multithread processing power of systems today.

"It's like if you have two gigabytes of DRAM, do you care if I take away 1 bit? Not really. If I take away half you'll care, or 25 percent. But if we get down to 1 percent, or .01 percent, you would never even notice that," Shapiro says.

That's how sophisticated the diagnosis engines in Solaris 10 are becoming -- and they continue to get better with each update of the operating system, he adds.

In fact, Shapiro and colleagues recently published a paper at the Dependable Systems and Networks conference sharing their findings that the self-healing diagnosis and response for memory failures in Solaris 10 can reduce yearly system downtime by 30 to 50 percent.

"It's like having a little expert system built into the computer that knows how to match those symptoms to a problem."

Mike Shapiro
Distinguished Engineer
Sun Microsystems

 

The challenge for Sun is, this is different.

"I think our competitors did the whole thing a disservice by just making it a bunch of fluff. That resulted in a backlash. Customers became numb to it. 'Yeah, yeah, everyone tells me their servers are self-healing. The reality is I still have five guys in suits running around in the datacenter doing the same things they used to, so how is this any different?' Well, the approach we've taken is much more data-driven and reality-driven. When we've talked about our features, we've talked about them by showing them to people," Shapiro says.

During presentations to customers at Sun's Executive Briefing Center in Menlo Park, California, Shapiro begins with a brief overview, and then ...

"I bring up a real server and say, 'You might be wondering, How do we actually test this? Well, we have ways of injecting real failures into the machine. One of the things we do is, we actually have software that can basically force real problems into the hardware, and we've worked with the hardware people here at Sun designing those interfaces,'" he says.

The beauty here is in making something so complex appear so simple. The processor fails and Solaris takes action, as customers watch.

"That changes the whole nature of the discussion because as soon as you do that, they get it," Shapiro says. "This is not a billboard. This is real."


Readers Survey
I found this article...
Not Informative   Informative   Very Informative
Comments:

Mike Shapiro

Title: Distinguished Engineer, Sun Microsystems.

Quote: "Design is just as much about the stuff that's left out as the stuff that's put in."

Background: Led the effort to design and build the Sun architecture for predictive self-healing and fault management. Co-creator of DTrace. Author of the DTrace compiler, D programming language, kernel panic subsystem, fmd(1M), mdb(1M), smbios(1M), dumpadm(1M), pgrep(1), pkill(1), and numerous enhancements to the /proc filesystem, core files, crash dumps, and Solaris hardware error handling.

Education: Bachelor's and master's degrees in computer science from Brown University.

Honors: Three-time winner of the Sun Chairman's Award for Innovation -- in 2001 for his work on Solaris technology to recover from processor and memory faults, in 2004 for his work on DTrace, and in 2005 for his work on predictive self-healing for SPARC and AMD Opteron systems. InfoWorld Innovation Award for DTrace and Predictive Self-Healing in 2005. Technology Innovation Award for DTrace from The Wall Street Journal in 2006.

Patents: 14 (pending).

Childhood Ambition: "To be a software engineer ... My dad taught me how to program, so that was something I was interested in pretty much as long as I can remember."

Hobbies: Cooking. Theater lighting. (His wife is a drama teacher.)

Favorite Food: "Anything from Italy. You can't go wrong."

Pet Peeve: Pessimism.

Last Book Read: Ghost Wars, by Steve Coll.

Little-Known Fact: He and his father are in a photo (featured on a DVD of Boston sports highlights) of fans at the last Celtics game ever played at Boston Garden.

Favorite Movie: The Maltese Falcon.

Favorite Song: "Anything by Bob Dylan."

First Job: "My friend and I had a car-washing service."

Perfect Day: "My favorite ways to spend time are, first of all, with my wife -- just relaxing, having a meal, and talking together. I also really like to have quiet time. If I can have some time with my family and some time thinking about something I'm passionate about, working on a hard problem, that's a great day for me."

What Brought Him to Sun: "I wanted to be an operating-system engineer and it was very obvious to me that the most innovative work in systems design at the OS level was happening at Sun, so I wanted to be a part of that."

What Inspires Him: "I'm constantly inspired by anyone who brings passion to what they do -- and that doesn't have to be engineers. I'm equally inspired by passionate actors, mathematicians, artists. Being a software engineer is very much a cross-disciplinary activity because there's an element of art, language, mathematics ... so if you listen to people who are passionate and look around for ideas -- those are the kinds of things that really inspire me.

 
Would you recommend this Sun site to a friend or colleague?
Contact About Sun News Employment Privacy Terms of Use Trademarks Copyright 1994-2008 Sun Microsystems, Inc.