Projects

BigData

Building introspective, reliable and scalable datacenter applications

Primary Research Category: System Science  



Project Overview

Building and operating Internet-scale applications that can reliably support millions of users is a challenge that requires new approaches to the software development, deployment, and debugging processes. Applications such as Youtube, Yelp, eBay and Google Maps are characterized by massive scale--spanning hundreds to thousands of nodes and supporting millions of users. Innovative applications written by hobbyists can go from supporting hundreds of users to millions of users in less than a day.

Supporting this massive scale requires that applications be deployed on hundreds or thousands of machines. A Google search query touches up to 2,000 servers that must all execute that query and respond in less than a third of a second. These applications are built upon new layers of abstraction, spanning distributed filesystems, RPC-based middleware layers, network protocols, operating system interfaces, memory and storage systems. When a failure occurs, discovering the exact set of machines and resources responsible, as well as the location among those resources where the failure occurred, is a daunting task. There is currently little support for understanding how applications interact with these layers, and impact lower-layer resources like disks and CPU. The result is that developers and operators have little insight into how their applications will behave when deployed at scale.

The aim of this project is to provide developers and operators with tools that:

  • provide visibility into the behavior of deployed applications at scale and their impact on underlying resources such as disk, CPU, and memory.

  • isolate the effects of individual client or user operations throughout the application and across the datacenter
  • operate during the application's execution, without requiring it to be taken down or run in a 'debug' mode

Our approach is to modify protocols, instrument applications, and collect fine-grained measurements from the applications and operating system to capture the impact of datacenter applications across the network stack, and across the components making up that application. Our current focus is centered on easing the burden on software developers for adding introspection capabilities to their deployed applications while reducing their impact on performance. This work is a continuation and extension to our previous work on X-Trace, a cross-layer network tracing framework developed at the UC Berkeley RAD Lab. We are extending X-Trace to look not only at the application to network boundary, but also the network to Operating System boundary as well.

As a specific case study of the kind of application we think will benefit from our approach, we consider data-intensive computing systems such as Map/Reduce, Hadoop, and HBase. These applications are increasingly utilized to analyze web traces, large graph data structures (such as the World Wide Web), and volumes of systems data and logs. They incorporate distributed filesystems, job schedulers, and analysis applications written by different users, and thus isolating the effects of individual jobs or user requests is challenging.

Related Publications



Team Members