|
| United States Worldwide |
|
Contrarian Minds: Kenny GrossLooking for Trouble By Al Riske 7.Oct.04--To understand the value of the latest innovation from Sun physicist Kenny Gross and his team in San Diego, California, you first need to understand the nature of the problem it solves. In the computer industry, the problem is common enough that technicians describe it with a standard abbreviation, NTF. It stands for "no trouble found," but it doesn't mean everything is okay. On the contrary, it means that the server has encountered a glitch, but nobody can figure out why. That's not good. "It might be a million-dollar machine, and the service engineers can't find a problem with it," Gross says. "Customers don't like to hear that the root cause cannot be determined." In fact, a year ago, that was exactly what happened at a giant European bank -- a major account Sun wanted to keep happy -- when Gross and team were called in. Their innovation, dubbed the Continuous System Telemetry Harness, was an extension of the work Gross had done in the nuclear power industry, where breakdowns that result in NTF diagnoses are simply not an option.
There, Gross was the lead developer of an award-winning method of statistical pattern recognition known as the multivariate state estimation technique, or MSET. The technique proved so effective in nuclear power plants that it has since been deployed on jet airliners. Even NASA's space shuttles use it. So why not large computer servers? Gross and his team (Keith Whisnant, Aleksey Urmanov, Kalyan Vaidyanathan, Sajjit Thampy, and telemetry co-inventor Larry Votta) set out to combine telemetry with advanced pattern recognition in high-end servers -- to monitor temperatures, voltages, currents, and a variety of performance metrics so they could see trouble coming from miles away. But they ran into a surprising response: "It can't be done." They were told there wouldn't be enough bandwidth to monitor signals from hundreds of sensors in the server. They were told that even if they harnessed all the signals, the samples would not be in synchrony, and thus, common statistical methods wouldn't be able to make sense of them. "We approached telemetry and sensory experts outside Sun who monitor signals from lots of types of process plants and engineering systems, but none had ever tried to monitor computer system internals," Gross recalls. "The reasons cited by various skeptics in academia as well as in technical society groups usually reduced to a circular argument: 'Since no one in the computing industry has ever implemented real-time telemetry and pattern recognition, there must be a reason the concept does not apply to computer servers.'" Gross didn't let the skepticism stop him.
Here it's important to note how things worked in the computer industry before the telemetry harness. If a particular component -- say a power supply, capacitor, or fan motor -- started to fail at an increasing rate in customers' data centers around the world, weeks might go by before the mechanism causing the problem could be recognized and its root cause identified. This would be particularly true for problems that were intermittent in nature, or that gave only very subtle or ambiguous evidence that something might be wrong. By the time a repair engineer could identify a suspect module and bring it back to a service center to test, the problem would disappear (kind of like that noise your car makes for you but not for your mechanic) and the module would be labeled NTF. "In the past, many such problems were initially uncovered simply by word-of-mouth among service engineers," Gross says. "The repair folks on the front lines, after replacing the same types of module several times for different customers, realize that something is amiss here. Then the concern works its way up the engineering chain."
Fortunately, Sun already had a long-standing practice of embedding temperature, voltage, and current sensors throughout its servers. The sensors were put there so a service engineer could sit down at the console and see how hot some module is getting or how low a voltage is at the moment. "But in the past, nobody harvested the signals," Gross says. "The innovation from my team was to start sampling those signals continuously with a new software tool and then use the correlations in those signals as a very sensitive diagnostic probe. The telemetry harness, which requires no hardware modifications at all, now becomes the EKG system monitoring the health of the server." The pattern recognition software monitors the various signals, learns the normal patterns among those signals, and warns you if anything out of the ordinary develops. And it keeps a month-long history of signal patterns (the equivalent of a black-box flight recorder on an airliner) to enable quick and accurate root-cause analysis. The "black-box" signals have helped to eliminate sources of NTFs in servers. More than that, the early warning capability of the telemetry harness enables service engineers to proactively replace at-risk components -- before the customer experiences any interruption in service. Imagine what such early warning software would mean in, say, financial services, where the cost of downtime can be as much as $6.45 million an hour in lost business and lost productivity.
Gross and his team started experimenting on large servers at Sun's Physical Sciences Research Center in San Diego, initially as a side project. Although there were plenty of skeptics, both inside and outside the company, Sun's Jud Cooley (the senior director above Gross' team) and CTO Greg Papadopoulos recognized from early experimental successes the value that continuous telemetry might bring. So they began to promote the team's growing portfolio of telemetry and pattern-recognition inventions to various software and hardware design teams across the organization. Then came the chance to show what the telemetry harness could do for a real customer -- a real unhappy customer. This was a major bank and, Gross admits, he was pretty nervous. He didn't want to be the guy who made hundreds of automated teller machines across Europe go blank. Although the telemetry harness had worked very well on internal Sun servers, he was about to install the software on a bank's production servers -- the servers that power all of the bank's day-to-day business activity.
This customer wasn't taking "no trouble found" for an answer. "Intermittent problems and NTFs are a big customer dissatisfier in the computing industry," Gross says. "If you buy a new car and your engine stalls out, the last thing you want to hear is that the dealer has no idea what's causing the problem. "High-end servers can cost $1 million to $3 million each. Of course the customer is upset if a machine crashes, but they're way more upset if you then say, 'Well, we don't have any clue what went wrong, so let's just start it up again and see what happens.' Customers have a lot more confidence if you can point to the exact component that failed, tell them why it failed, and say, 'We've replaced the faulty component, so you're as good as new now.'" And that's just what the telemetry harness achieved for this important customer. "We took the harness to the customer's data center, put it on all their machines, and it solved the problem," Gross says in his soft-spoken, matter-of-fact way. "The bank was extremely happy. Their execs asked that the telemetry harness be left on all their production servers permanently." Although the initial version installed on the bank's machines required Gross to look at the signatures of the system variables to spot anomalies by eyeball, the newer version uses automated pattern recognition. Just like in the nuclear plants. That success changed things for the bank and for Sun. For Sun, it sped the delivery of the Continuous System Telemetry Harness, a version of which is now available on Sun's high-end and midrange servers and will soon be extended to the entire server product line. "Nobody else has this. Just Sun. And we have a thick portfolio of patents on it," Gross says. And the bank? "They sent a letter last month saying they've gone to five-nines [99.999 percent] availability now," Gross says. |
|
||||||||||