Hypervisor-based fault tolerance
ACM Transactions on Computer Systems (TOCS) - Special issue on operating system principles
RecPlay: a fully integrated practical record/replay system
ACM Transactions on Computer Systems (TOCS)
Apache: The Definitive Guide
Pinpoint: Problem Determination in Large, Dynamic Internet Services
DSN '02 Proceedings of the 2002 International Conference on Dependable Systems and Networks
Efficient on-the-fly data race detection in multithreaded C++ programs
Proceedings of the ninth ACM SIGPLAN symposium on Principles and practice of parallel programming
Performance debugging for distributed systems of black boxes
SOSP '03 Proceedings of the nineteenth ACM symposium on Operating systems principles
Managing Web server performance with AutoTune agents
IBM Systems Journal
STRIDER: A Black-box, State-based Approach to Change and Configuration Management and Support
LISA '03 Proceedings of the 17th USENIX conference on System administration
ReVirt: enabling intrusion analysis through virtual-machine logging and replay
OSDI '02 Proceedings of the 5th symposium on Operating systems design and implementationCopyright restrictions prevent ACM from being able to make the PDFs for this conference available for downloading
Pin: building customized program analysis tools with dynamic instrumentation
Proceedings of the 2005 ACM SIGPLAN conference on Programming language design and implementation
Undo for operators: building an undoable e-mail store
ATEC '03 Proceedings of the annual conference on USENIX Annual Technical Conference
Making the "box" transparent: system call performance as a first-class result
ATEC '04 Proceedings of the annual conference on USENIX Annual Technical Conference
Dynamic instrumentation of production systems
ATEC '04 Proceedings of the annual conference on USENIX Annual Technical Conference
Flashback: a lightweight extension for rollback and deterministic replay for software debugging
ATEC '04 Proceedings of the annual conference on USENIX Annual Technical Conference
Path-based faliure and evolution management
NSDI'04 Proceedings of the 1st conference on Symposium on Networked Systems Design and Implementation - Volume 1
Understanding and dealing with operator mistakes in internet services
OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Configuration debugging as search: finding the needle in the haystack
OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Automatic misconfiguration troubleshooting with peerpressure
OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Using magpie for request extraction and workload modelling
OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Why do internet services fail, and what can be done about it?
USITS'03 Proceedings of the 4th conference on USENIX Symposium on Internet Technologies and Systems - Volume 4
Replay debugging for distributed applications
ATEC '06 Proceedings of the annual conference on USENIX '06 Annual Technical Conference
Automatic configuration of internet services
Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007
AutoBash: improving configuration management with operating system causality analysis
Proceedings of twenty-first ACM SIGOPS symposium on Operating systems principles
Decoupling dynamic program analysis from execution in virtual environments
ATC'08 USENIX 2008 Annual Technical Conference on Annual Technical Conference
Boosting the performance of computing systems through adaptive configuration tuning
Proceedings of the 2009 ACM symposium on Applied Computing
PRES: probabilistic replay with execution sketching on multiprocessors
Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles
ODR: output-deterministic replay for multicore debugging
Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles
PinPlay: a framework for deterministic replay and reproducible analysis of parallel programs
Proceedings of the 8th annual IEEE/ACM international symposium on Code generation and optimization
Black-box problem diagnosis in parallel file systems
FAST'10 Proceedings of the 8th USENIX conference on File and storage technologies
Lightweight, high-resolution monitoring for troubleshooting production systems
OSDI'08 Proceedings of the 8th USENIX conference on Operating systems design and implementation
KLEE: unassisted and automatic generation of high-coverage tests for complex systems programs
OSDI'08 Proceedings of the 8th USENIX conference on Operating systems design and implementation
JustRunIt: experiment-based management of virtualized data centers
USENIX'09 Proceedings of the 2009 conference on USENIX Annual technical conference
Automating configuration troubleshooting with dynamic information flow analysis
OSDI'10 Proceedings of the 9th USENIX conference on Operating systems design and implementation
DoublePlay: parallelizing sequential logging and replay
Proceedings of the sixteenth international conference on Architectural support for programming languages and operating systems
S2E: a platform for in-vivo multi-path analysis of software systems
Proceedings of the sixteenth international conference on Architectural support for programming languages and operating systems
Diagnosing performance changes by comparing request flows
Proceedings of the 8th USENIX conference on Networked systems design and implementation
Profiling network performance for multi-tier data center applications
Proceedings of the 8th USENIX conference on Networked systems design and implementation
X-trace: a pervasive network tracing framework
NSDI'07 Proceedings of the 4th USENIX conference on Networked systems design & implementation
An empirical study on configuration errors in commercial and open source systems
SOSP '11 Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles
Detecting and surviving data races using complementary schedules
SOSP '11 Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles
Web caching on smartphones: ideal vs. reality
Proceedings of the 10th international conference on Mobile systems, applications, and services
Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles
ACM SIGOPS 24th Symposium on Operating Systems Principles
Do not blame users for misconfigurations
Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles
Limplock: understanding the impact of limpware on scale-out cloud systems
Proceedings of the 4th annual Symposium on Cloud Computing
Robust assessment of changes in cellular networks
Proceedings of the ninth ACM conference on Emerging networking experiments and technologies
Comprehending performance from real-world execution traces: a device-driver case
Proceedings of the 19th international conference on Architectural support for programming languages and operating systems
EnCore: exploiting system environment and correlation information for misconfiguration detection
Proceedings of the 19th international conference on Architectural support for programming languages and operating systems
Performance troubleshooting in data centers: an annotated bibliography?
ACM SIGOPS Operating Systems Review
Hi-index | 0.00 |
Troubleshooting the performance of production software is challenging. Most existing tools, such as profiling, tracing, and logging systems, reveal what events occurred during performance anomalies. However, users of such toolsmust infer why these events occurred; e.g., that their execution was due to a root cause such as a specific input request or configuration setting. Such inference often requires source code and detailed application knowledge that is beyond system administrators and end users. This paper introduces performance summarization, a technique for automatically diagnosing the root causes of performance problems. Performance summarization instruments binaries as applications execute. It first attributes performance costs to each basic block. It then uses dynamic information flow tracking to estimate the likelihood that a block was executed due to each potential root cause. Finally, it summarizes the overall cost of each potential root cause by summing the per-block cost multiplied by the cause-specific likelihood over all basic blocks. Performance summarization can also be performed differentially to explain performance differences between two similar activities. X-ray is a tool that implements performance summarization. Our results show that X-ray accurately diagnoses 17 performance issues in Apache, lighttpd, Postfix, and PostgreSQL, while adding 2.3% average runtime overhead.