Skip to Content Java Solaris Communities Partners My Sun Sun Store United States Worldwide

»  Contrarian Minds Archive

Contrarian Minds: Kenny Gross

Looking for Trouble

By Al Riske

7.Oct.04--To understand the value of the latest innovation from Sun physicist Kenny Gross and his team in San Diego, California, you first need to understand the nature of the problem it solves.

In the computer industry, the problem is common enough that technicians describe it with a standard abbreviation, NTF. It stands for "no trouble found," but it doesn't mean everything is okay. On the contrary, it means that the server has encountered a glitch, but nobody can figure out why.

That's not good.

"It might be a million-dollar machine, and the service engineers can't find a problem with it," Gross says. "Customers don't like to hear that the root cause cannot be determined."

In fact, a year ago, that was exactly what happened at a giant European bank -- a major account Sun wanted to keep happy -- when Gross and team were called in.

Their innovation, dubbed the Continuous System Telemetry Harness, was an extension of the work Gross had done in the nuclear power industry, where breakdowns that result in NTF diagnoses are simply not an option.

There, Gross was the lead developer of an award-winning method of statistical pattern recognition known as the multivariate state estimation technique, or MSET.

The technique proved so effective in nuclear power plants that it has since been deployed on jet airliners. Even NASA's space shuttles use it.

So why not large computer servers?

Gross and his team (Keith Whisnant, Aleksey Urmanov, Kalyan Vaidyanathan, Sajjit Thampy, and telemetry co-inventor Larry Votta) set out to combine telemetry with advanced pattern recognition in high-end servers -- to monitor temperatures, voltages, currents, and a variety of performance metrics so they could see trouble coming from miles away. But they ran into a surprising response: "It can't be done."

They were told there wouldn't be enough bandwidth to monitor signals from hundreds of sensors in the server.

They were told that even if they harnessed all the signals, the samples would not be in synchrony, and thus, common statistical methods wouldn't be able to make sense of them.

"We approached telemetry and sensory experts outside Sun who monitor signals from lots of types of process plants and engineering systems, but none had ever tried to monitor computer system internals," Gross recalls.

"The reasons cited by various skeptics in academia as well as in technical society groups usually reduced to a circular argument: 'Since no one in the computing industry has ever implemented real-time telemetry and pattern recognition, there must be a reason the concept does not apply to computer servers.'"

Gross didn't let the skepticism stop him.

"If I buy a new car and my engine stalls out, the last thing I want to hear is that the dealer has no idea what is causing the problem."

Kenny Gross
Senior Physicist, Scalable Systems Group
Sun Microsystems

 

Here it's important to note how things worked in the computer industry before the telemetry harness.

If a particular component -- say a power supply, capacitor, or fan motor -- started to fail at an increasing rate in customers' data centers around the world, weeks might go by before the mechanism causing the problem could be recognized and its root cause identified. This would be particularly true for problems that were intermittent in nature, or that gave only very subtle or ambiguous evidence that something might be wrong. By the time a repair engineer could identify a suspect module and bring it back to a service center to test, the problem would disappear (kind of like that noise your car makes for you but not for your mechanic) and the module would be labeled NTF.

"In the past, many such problems were initially uncovered simply by word-of-mouth among service engineers," Gross says. "The repair folks on the front lines, after replacing the same types of module several times for different customers, realize that something is amiss here. Then the concern works its way up the engineering chain."

Fortunately, Sun already had a long-standing practice of embedding temperature, voltage, and current sensors throughout its servers. The sensors were put there so a service engineer could sit down at the console and see how hot some module is getting or how low a voltage is at the moment.

"But in the past, nobody harvested the signals," Gross says. "The innovation from my team was to start sampling those signals continuously with a new software tool and then use the correlations in those signals as a very sensitive diagnostic probe. The telemetry harness, which requires no hardware modifications at all, now becomes the EKG system monitoring the health of the server."

The pattern recognition software monitors the various signals, learns the normal patterns among those signals, and warns you if anything out of the ordinary develops. And it keeps a month-long history of signal patterns (the equivalent of a black-box flight recorder on an airliner) to enable quick and accurate root-cause analysis. The "black-box" signals have helped to eliminate sources of NTFs in servers.

More than that, the early warning capability of the telemetry harness enables service engineers to proactively replace at-risk components -- before the customer experiences any interruption in service.

Imagine what such early warning software would mean in, say, financial services, where the cost of downtime can be as much as $6.45 million an hour in lost business and lost productivity.

"Nobody else has this. Just Sun. And we have a thick portfolio of patents on it."

Kenny Gross
Senior Physicist, Scalable Systems Group
Sun Microsystems

 

Gross and his team started experimenting on large servers at Sun's Physical Sciences Research Center in San Diego, initially as a side project.

Although there were plenty of skeptics, both inside and outside the company, Sun's Jud Cooley (the senior director above Gross' team) and CTO Greg Papadopoulos recognized from early experimental successes the value that continuous telemetry might bring. So they began to promote the team's growing portfolio of telemetry and pattern-recognition inventions to various software and hardware design teams across the organization.

Then came the chance to show what the telemetry harness could do for a real customer -- a real unhappy customer.

This was a major bank and, Gross admits, he was pretty nervous. He didn't want to be the guy who made hundreds of automated teller machines across Europe go blank. Although the telemetry harness had worked very well on internal Sun servers, he was about to install the software on a bank's production servers -- the servers that power all of the bank's day-to-day business activity.

This customer wasn't taking "no trouble found" for an answer.

"Intermittent problems and NTFs are a big customer dissatisfier in the computing industry," Gross says. "If you buy a new car and your engine stalls out, the last thing you want to hear is that the dealer has no idea what's causing the problem.

"High-end servers can cost $1 million to $3 million each. Of course the customer is upset if a machine crashes, but they're way more upset if you then say, 'Well, we don't have any clue what went wrong, so let's just start it up again and see what happens.' Customers have a lot more confidence if you can point to the exact component that failed, tell them why it failed, and say, 'We've replaced the faulty component, so you're as good as new now.'"

And that's just what the telemetry harness achieved for this important customer.

"We took the harness to the customer's data center, put it on all their machines, and it solved the problem," Gross says in his soft-spoken, matter-of-fact way. "The bank was extremely happy. Their execs asked that the telemetry harness be left on all their production servers permanently."

Although the initial version installed on the bank's machines required Gross to look at the signatures of the system variables to spot anomalies by eyeball, the newer version uses automated pattern recognition. Just like in the nuclear plants.

That success changed things for the bank and for Sun.

For Sun, it sped the delivery of the Continuous System Telemetry Harness, a version of which is now available on Sun's high-end and midrange servers and will soon be extended to the entire server product line.

"Nobody else has this. Just Sun. And we have a thick portfolio of patents on it," Gross says.

And the bank?

"They sent a letter last month saying they've gone to five-nines [99.999 percent] availability now," Gross says.


Readers Survey
I found this article...
Not Informative   Informative   Very Informative
Comments:

Kenny Gross

Title: Senior Physicist, Scalable Systems Group at Sun.

Job: Improving the reliability and quality of service of enterprise computing systems.

Claim to Fame: Coupling telemetry with advanced pattern recognition in computer servers through the Continuous System Telemetry Harness.

Quote: "Nobody else has this. Just Sun. And we have a thick portfolio of patents on it."

Results: Five nines (99.999 percent) availability for the product's first customer. Ongoing reduction in costly "no trouble found" events across the midrange and high-end server lines.

Honors: 2004 Chairman's Award for Innovation, in recognition of his innovations in telemetry and advanced pattern recognition. 1998 R&D 100 Award for MSET, the multivariate state estimation technique, an advanced method of statistical pattern recognition used in safety-critical applications.

Influence: Nuclear power plants, jet airliners, and even NASA's space shuttles use techniques that Gross and his nuclear research team pioneered back in the 1990s. Now, so do Sun servers.

Patents: 56 issued and pending.

Education: Doctoral degree in nuclear engineering from the University of Cincinnati.

Background: Came to Sun in 2000 from Argonne National Laboratory, where he was a manager and principal investigator for 23 years, developing a variety of statistical, instrumentation, and pattern-recognition innovations for improving the reliability of safety-critical systems for commercial nuclear and aerospace applications.

What Brought Him to Sun: "It was the excitement of being in an industry and a company where, if we can apply new innovations to make positive things happen for our customers, we all gain and share the rewards."

Pet Peeve: Layers of bureaucracy (after 20-plus years in a government laboratory). ("There's extremely little bureaucratic impedance to technological progress here at Sun.")

Little-Known Fact: Enjoys studying applied probability and statistics in casinos. ("Thanks to my depth of knowledge in statistics, I lose much more slowly than a lot of common casino-goers.")

Favorite Food: Bone-in rib-eye steak. Medium well, please.

Last Book Read: The History of Rasselas, by Dr. Samuel Johnson.

Hobbies: Enjoys studying classical Newtonian physics on a billiards table and non-Newtonian physics in a racquetball court.

What He Wanted to Be When He Grew Up: A physicist. ("I recognized early on that math and science were my least boring subjects, so I decided by the time I was in high school that I wanted to be a physicist.")

Most-Admired Person: Rick Lytel. ("He's one of the country's most brilliant physicists of our time and was a major scientific thought leader throughout his tenure at Sun.")

What Keeps Him Up at Night: The excitement of taking on new challenges.

 
Would you recommend this Sun site to a friend or colleague?
Contact About Sun News Employment Privacy Terms of Use Trademarks Copyright 1994-2008 Sun Microsystems, Inc.