Sun and Oracle Community Voices How to Buy Log In United States [Change] English

»  Contrarian Minds Archive

Every System Tells a Story

Getting to know the unique personalities of complex systems.

By Al Riske

22.May.06--If Jon Greaves has learned anything providing remote monitoring and management services to over 160 customers, it's this: No two systems are the same.

Even when they are the same, they're different, says Sun's Distinguished Engineer.

Same hardware. Same software. Different personalities.

The differences have to do with workloads and understanding the customer's business processes -- things like access profiles, available bandwidth, and seasonal variations in the application's usage -- and, as Greaves puts it, "the relationship between the strands of interaction in the system."

The good news is that Sun has the ability to learn the unique personality of any system or application and predict how it will behave.

Although the technique, called "personality mapping," is still in nascent form, it has already proved effective with customers, including an international banking consortium.


"The customer was having a problem with an application crashing every few weeks, without much warning," Greaves recalls.

The application in question happened to a global messaging and collaboration system using software from one of Sun's competitors.

"We'd find out about it at exactly the same time the customer would. It didn't make anyone happy, least of all our engineers, who are very passionate about proactively fixing issues before they become problems," he says.

So Greaves and team instrumented the system, collected telemetry, and applied some patented algorithms. That gave them the personality map they needed to fix the issue.

"Competitors that are providing managed services are very much in reactive mode. They might do some trending, but trending isn't quite the same as what we're doing here at Sun. This is much more predictive than trending. Trending is really looking at historic values and projecting out a few snapshots beyond the current time,” Greaves says. “For example, you might want to trend CPU usage over time to get a flavor for when you might need to increase capacity. But it becomes very difficult to use that kind of technique to do predictive analyses where there are so many more harmonics and dynamics at play."

The bottom line: The Sun team was able to provide the customer with timely remediation advice.

"We went back to the customer, who looked at us in disbelief when we told them, 'In three days time, unless we do a scheduled, controlled failover of your cluster, it's going to crash. We should talk to the vendor to provide a long-term patch.'"

As Greaves points out, that's a very different conversation than the alarming your-roof-is-on-fire variety.

"It didn't make anyone happy, least of all our engineers, who are very passionate about proactively fixing issues before they become problems."

Jon Greaves
Distinguished Engineer
Sun Microsystems

 

Greaves joined Sun with the acquisition of SevenSpace and is now chief technologist in Sun Managed Operations.

"We provide remote monitoring and management for a wide variety of devices, servers, and applications -- just about anything you can imagine for a customer, from routers to mainframes. One of the challenges for the whole industry is getting more proactive rather than reactive. If something breaks, service providers see it about the same time the customer would and then take corrective action, so it's always kind of just too late," Greaves says.

"What we're doing now is taking some of the modeling techniques Kenny Gross and others at Sun have developed and pairing that with some of the knowledge we have about what are the right things to collect from a system and look for ways of predicting when problems are going to occur."

Greaves has been working closely with Gross, co-inventor of the Continuous System Telemetry Harness, which combines telemetry -- temperatures, voltages, currents, and a variety of performance metrics -- with advanced pattern recognition.

"Kenny was focused on getting telemetry out of hardware. Telemetry for me can come from anywhere -- the hardware, the operating system, the applications. You name an application, I've probably got a customer using it out there in the field," Greaves says.

At times the diversity of systems and data can seem almost overwhelming. Greaves thinks of it this way:

"If you look at the health-care industry, diagnostic tools such as an EKG will tell you something very specific is wrong with a very specific part of your body, which is really fantastic,” he says. “An MRI gives you more of a holistic view of the patient, but it gives you so much data you need to filter out the noise. That's kind of the problem domain I'm working on right now -- trying to understand what good data is and how different streams of good data interact to form these personalities."

"Jon has an uncanny ability to look at collections of ostensibly chaotic signals that constitute performance metrics from datacenters and spot anomalies that can signify the onset of problems in the network."

Kenny Gross
Distinguished Engineer
Sun Microsystems

 

Greaves says he and Gross hit it off immediately, even though they have very different backgrounds. Gross is a physicist with a doctorate in nuclear engineering and a real knack for cooking up complex algorithms. Greaves on the other hand was trained as a telephony technician ("poles and holes") in his native England and has no college degree, though he does have extensive experience he describes as "learning in the field, firefighting with customers, and all that good stuff."

"Jon has an uncanny ability to look at collections of ostensibly chaotic signals that constitute performance metrics from datacenters and spot anomalies that can signify the onset of problems in the network, or, in other cases, challenges to the security of those networks," Gross says.

In fact, security was one of the first things the two men discussed, in a one-hour meeting that lasted two.

"Assume you are monitoring a customer's computer systems using a predictive technique and understand the personality of the system," Greaves says. "In some cases, understanding the personality could enable you to make educated guesses about the company's future financial performance."

This potential misuse of telemetry led Greaves and Gross to file a joint patent containing algorithms to camouflage sensitive data while still allowing its use in predicting failures.

"If you can detect the relationship between the strands of interaction in the system, that's when you can build that personality map up and understand what's going to impact you."

Jon Greaves
Distinguished Engineer
Sun Microsystems

 

More to the point, the two men share a passion for keeping systems up and running around the clock.

Gross was the lead developer of the multivariate state estimation technique, or MSET, an award-winning method of statistical pattern recognition. He's also an expert in something called a sequential probability ratio test, or SPRT (pronounced "spurt").

"Kenny understands intimately the nuances of MSET vs. SPRT. To me that's not as important as understanding how I can get data in there effectively and how I can get data out, so I can let our customer-support engineers know this is a problem they need to be working on," Greaves says.

"I go deep enough to be able to apply the algorithms. I'm not trying to invent them. I'm just trying to find good ways to solve my problem, which is helping customers run computer systems. But I don't necessarily need to understand all the nuances of them. I need to have a pretty good understanding of what their abilities are and how to teach them."

Greaves and his team choose what data to watch based on their field experience.

"It's really a lot of very smart engineers we have who know these applications well, and they help us develop what we call key performance indicators. Sometimes there are a few hundred on each system or each type of application we're working with. These might be things like CPU usage, memory usage, application run queues -- a whole variety of things. Traditionally you'd look at those in isolation. But if you can detect the relationship between the strands of interaction in the system, that's when you can build that personality map up and understand what's going to impact you," Greaves says.

"If you go to any of our competitors in the industry, they will say they can give you three nines [99.9 percent uptime] by reacting very quickly to a problem when it occurs. But the number of nines is going up all the time, and the number of seconds you have to react to a problem is really diminishing."

In short, personality mapping is emerging as a critical tool.

"Right now, it's the technology we use for deep troubleshooting, but our goal is to start moving this into being more of an automated solution and actually embedding this in a lot of the services we provide."


Readers Survey
I found this article...
Not Informative   Informative   Very Informative
Comments:

Jon Greaves

Title: Distinguished Engineer and Chief Technologist, Sun Managed Operations.

Background: Twelve years with British Telecom, MCI, and Concert Communications in various capacities. CTO of the startup SevenSpace, a managed services provider. Joined Sun Microsystems in January 2005, when the company acquired SevenSpace.

Formal Education: Sixth-form vocational studies in his native England. Technician training with British Telecom.

Patents: Seven filed in past year.

Quote: "A fool with a tool is still a fool. Even if you get these really cool algorithms, if you don't know how to apply them, you haven't really taken a step forward."

What Others Say: "Jon has an uncanny ability to look at collections of ostensibly chaotic signals that constitute performance metrics from datacenters and spot anomalies that can signify the onset of problems in the network, or, in other cases, challenges to the security of those networks. That inborn pattern-recognition talent, coupled with Jon's knack for teaching others what his eyes see, and ultimately embodying his heuristics into automated 'control tower' software systems, has led to substantial improvements in the reliability and security of complex e-commerce networks." - Sun Distinguished Engineer Kenny Gross

Life-Changing Event: Broke elbow falling off a commando bridge as a Cub Scout. ("That's when I got my first computer and I just got hooked at that point, which drove me into joining BT and going that route.")

Hobbies: Coaching tennis and watching any kind of motor sports, from Formula One to lawn-mower racing.

Blog: http://blogs.sun.com/jon

Favorite Food: Barbecued brisket.

Last Book Read: Softwar: An Intimate Portrait of Larry Ellison and Oracle, by Matthew Symonds.

Favorite Singer: dishwalla.

Favorite Movie: Dodgeball (2004), directed by Rawson Marshall Thurber.

Little-Known Fact: "I'm actually a carpet fitter by trade."

Inspiration: "Putting myself in our customers' shoes."

Proudest Moment: The birth of my son.

Biggest Challenge: Having a newborn and starting a new job at the same time. ("An operations-based startup like SevenSpace requires a 24/7 commitment, as does your family when it's that young.")

Perfect Day: "Going after customer problems and finding elegant ways to instrument and monitor them so we can get ahead of the curve, avoid an outage, increase availability."