Lightweight, high-resolution monitoring for troubleshooting production systems

Authors:
Sapan Bhatia;Abhishek Kumar;Marc E. Fiuczynski;Larry Peterson
Affiliations:
Princeton University;Google Inc.;Princeton University;Princeton University
Venue:
OSDI'08 Proceedings of the 8th USENIX conference on Operating systems design and implementation
Year:
2008

Citing 22
Cited 18

Software testing techniques (2nd ed.)

Software testing techniques (2nd ed.)
Continuous profiling: where have all the cycles gone?

ACM Transactions on Computer Systems (TOCS)
System support for automatic profiling and optimization

Proceedings of the sixteenth ACM symposium on Operating systems principles
Summary cache: a scalable wide-area Web cache sharing protocol

Proceedings of the ACM SIGCOMM '98 conference on Applications, technologies, architectures, and protocols for computer communication
The Basics of Performance-Monitoring Hardware

IEEE Micro
Bug isolation via remote program sampling

PLDI '03 Proceedings of the ACM SIGPLAN 2003 conference on Programming language design and implementation
Gprof: A call graph execution profiler

SIGPLAN '82 Proceedings of the 1982 SIGPLAN symposium on Compiler construction
Building a better NetFlow

Proceedings of the 2004 conference on Applications, technologies, architectures, and protocols for computer communications
Capturing, indexing, clustering, and retrieving system history

Proceedings of the twentieth ACM symposium on Operating systems principles
CoMon: a mostly-scalable monitoring system for PlanetLab

ACM SIGOPS Operating Systems Review
Using model checking to find serious file system errors

ACM Transactions on Computer Systems (TOCS)
Debugging operating systems with time-traveling virtual machines

ATEC '05 Proceedings of the annual conference on USENIX Annual Technical Conference
Dynamic instrumentation of production systems

ATEC '04 Proceedings of the annual conference on USENIX Annual Technical Conference
Flashback: a lightweight extension for rollback and deterministic replay for software debugging

ATEC '04 Proceedings of the annual conference on USENIX Annual Technical Conference
Correlating instrumentation data to system states: a building block for automated diagnosis and control

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Flight data recorder: monitoring persistent-state interactions to improve systems management

OSDI '06 Proceedings of the 7th USENIX Symposium on Operating Systems Design and Implementation - Volume 7
Experiences building PlanetLab

OSDI '06 Proceedings of the 7th USENIX Symposium on Operating Systems Design and Implementation - Volume 7
Replay debugging for distributed applications

ATEC '06 Proceedings of the annual conference on USENIX '06 Annual Technical Conference
lmbench: portable tools for performance analysis

ATEC '96 Proceedings of the 1996 annual conference on USENIX Annual Technical Conference
JIT instrumentation: a novel approach to dynamically instrument operating systems

Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007
Rx: Treating bugs as allergies—a safe method to survive software failures

ACM Transactions on Computer Systems (TOCS)
Triage: diagnosing production run failures at the user's site

Proceedings of twenty-first ACM SIGOPS symposium on Operating systems principles

mBrace: action-based performance monitoring of multi-tier web applications

Proceedings of the Third Workshop on Dependable Distributed Data Management
Detailed diagnosis in enterprise networks

Proceedings of the ACM SIGCOMM 2009 conference on Data communication
Towards versatile performance models for complex, popular applications

ACM SIGMETRICS Performance Evaluation Review
Practical performance models for complex, popular applications

Proceedings of the ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Adaptive system anomaly prediction for large-scale hosting infrastructures

Proceedings of the 29th ACM SIGACT-SIGOPS symposium on Principles of distributed computing
Improving software diagnosability via log enhancement

Proceedings of the sixteenth international conference on Architectural support for programming languages and operating systems
Otus: resource attribution in data-intensive clusters

Proceedings of the second international workshop on MapReduce and its applications
Fay: extensible distributed tracing from kernels to clusters

SOSP '11 Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles
Improving Software Diagnosability via Log Enhancement

ACM Transactions on Computer Systems (TOCS) - Special Issue APLOS 2011
Monere: monitoring of service compositions for failure diagnosis

ICSOC'11 Proceedings of the 9th international conference on Service-Oriented Computing
UBL: unsupervised behavior learning for predicting performance anomalies in virtualized cloud systems

Proceedings of the 9th international conference on Autonomic computing
Light-weight black-box failure detection for distributed systems

Proceedings of the 2012 workshop on Management of big data systems
Fay: Extensible Distributed Tracing from Kernels to Clusters

ACM Transactions on Computer Systems (TOCS)
X-ray: automating root-cause diagnosis of performance anomalies in production software

OSDI'12 Proceedings of the 10th USENIX conference on Operating Systems Design and Implementation
Fmeter: extracting indexable low-level system signatures by counting kernel function calls

Proceedings of the 13th International Middleware Conference
VScope: middleware for troubleshooting time-sensitive data center applications

Proceedings of the 13th International Middleware Conference
Adaptive monitoring: a framework to adapt passive monitoring using probing

Proceedings of the 8th International Conference on Network and Service Management
Performance troubleshooting in data centers: an annotated bibliography?

ACM SIGOPS Operating Systems Review

Quantified Score

Hi-index	0.00

Visualization

Abstract

Production systems are commonly plagued by intermittent problems that are difficult to diagnose. This paper describes a new diagnostic tool, called Chopstix, that continuously collects profiles of low-level OS events (e.g., scheduling, L2 cache misses, CPU utilization, I/O operations, page allocation, locking) at the granularity of executables, procedures and instructions. Chopstix then reconstructs these events offline for analysis. We have used Chopstix to diagnose several elusive problems in a largescale production system, thereby reducing these intermittent problems to reproducible bugs that can be debugged using standard techniques. The key to Chopstix is an approximate data collection strategy that incurs very low overhead. An evaluation shows Chopstix requires under 1% of the CPU, under 256KB of RAM, and under 16MB of disk space per day to collect a rich set of system-wide data.