Software testing techniques (2nd ed.)
Software testing techniques (2nd ed.)
Continuous profiling: where have all the cycles gone?
ACM Transactions on Computer Systems (TOCS)
System support for automatic profiling and optimization
Proceedings of the sixteenth ACM symposium on Operating systems principles
Summary cache: a scalable wide-area Web cache sharing protocol
Proceedings of the ACM SIGCOMM '98 conference on Applications, technologies, architectures, and protocols for computer communication
Bug isolation via remote program sampling
PLDI '03 Proceedings of the ACM SIGPLAN 2003 conference on Programming language design and implementation
Gprof: A call graph execution profiler
SIGPLAN '82 Proceedings of the 1982 SIGPLAN symposium on Compiler construction
Proceedings of the 2004 conference on Applications, technologies, architectures, and protocols for computer communications
Capturing, indexing, clustering, and retrieving system history
Proceedings of the twentieth ACM symposium on Operating systems principles
CoMon: a mostly-scalable monitoring system for PlanetLab
ACM SIGOPS Operating Systems Review
Using model checking to find serious file system errors
ACM Transactions on Computer Systems (TOCS)
Debugging operating systems with time-traveling virtual machines
ATEC '05 Proceedings of the annual conference on USENIX Annual Technical Conference
Dynamic instrumentation of production systems
ATEC '04 Proceedings of the annual conference on USENIX Annual Technical Conference
Flashback: a lightweight extension for rollback and deterministic replay for software debugging
ATEC '04 Proceedings of the annual conference on USENIX Annual Technical Conference
OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Flight data recorder: monitoring persistent-state interactions to improve systems management
OSDI '06 Proceedings of the 7th USENIX Symposium on Operating Systems Design and Implementation - Volume 7
Experiences building PlanetLab
OSDI '06 Proceedings of the 7th USENIX Symposium on Operating Systems Design and Implementation - Volume 7
Replay debugging for distributed applications
ATEC '06 Proceedings of the annual conference on USENIX '06 Annual Technical Conference
lmbench: portable tools for performance analysis
ATEC '96 Proceedings of the 1996 annual conference on USENIX Annual Technical Conference
JIT instrumentation: a novel approach to dynamically instrument operating systems
Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007
Rx: Treating bugs as allergies—a safe method to survive software failures
ACM Transactions on Computer Systems (TOCS)
Triage: diagnosing production run failures at the user's site
Proceedings of twenty-first ACM SIGOPS symposium on Operating systems principles
mBrace: action-based performance monitoring of multi-tier web applications
Proceedings of the Third Workshop on Dependable Distributed Data Management
Detailed diagnosis in enterprise networks
Proceedings of the ACM SIGCOMM 2009 conference on Data communication
Towards versatile performance models for complex, popular applications
ACM SIGMETRICS Performance Evaluation Review
Practical performance models for complex, popular applications
Proceedings of the ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Adaptive system anomaly prediction for large-scale hosting infrastructures
Proceedings of the 29th ACM SIGACT-SIGOPS symposium on Principles of distributed computing
Improving software diagnosability via log enhancement
Proceedings of the sixteenth international conference on Architectural support for programming languages and operating systems
Otus: resource attribution in data-intensive clusters
Proceedings of the second international workshop on MapReduce and its applications
Fay: extensible distributed tracing from kernels to clusters
SOSP '11 Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles
Improving Software Diagnosability via Log Enhancement
ACM Transactions on Computer Systems (TOCS) - Special Issue APLOS 2011
Monere: monitoring of service compositions for failure diagnosis
ICSOC'11 Proceedings of the 9th international conference on Service-Oriented Computing
Proceedings of the 9th international conference on Autonomic computing
Light-weight black-box failure detection for distributed systems
Proceedings of the 2012 workshop on Management of big data systems
Fay: Extensible Distributed Tracing from Kernels to Clusters
ACM Transactions on Computer Systems (TOCS)
X-ray: automating root-cause diagnosis of performance anomalies in production software
OSDI'12 Proceedings of the 10th USENIX conference on Operating Systems Design and Implementation
Fmeter: extracting indexable low-level system signatures by counting kernel function calls
Proceedings of the 13th International Middleware Conference
VScope: middleware for troubleshooting time-sensitive data center applications
Proceedings of the 13th International Middleware Conference
Adaptive monitoring: a framework to adapt passive monitoring using probing
Proceedings of the 8th International Conference on Network and Service Management
Performance troubleshooting in data centers: an annotated bibliography?
ACM SIGOPS Operating Systems Review
Hi-index | 0.00 |
Production systems are commonly plagued by intermittent problems that are difficult to diagnose. This paper describes a new diagnostic tool, called Chopstix, that continuously collects profiles of low-level OS events (e.g., scheduling, L2 cache misses, CPU utilization, I/O operations, page allocation, locking) at the granularity of executables, procedures and instructions. Chopstix then reconstructs these events offline for analysis. We have used Chopstix to diagnose several elusive problems in a largescale production system, thereby reducing these intermittent problems to reproducible bugs that can be debugged using standard techniques. The key to Chopstix is an approximate data collection strategy that incurs very low overhead. An evaluation shows Chopstix requires under 1% of the CPU, under 256KB of RAM, and under 16MB of disk space per day to collect a rich set of system-wide data.