Triage: diagnosing production run failures at the user's site

Authors:
Joseph Tucek;Shan Lu;Chengdu Huang;Spiros Xanthos;Yuanyuan Zhou
Affiliations:
University of Illinois at Urbana Champaign, Urbana, IL;University of Illinois at Urbana Champaign, Urbana, IL;University of Illinois at Urbana Champaign, Urbana, IL;University of Illinois at Urbana Champaign, Urbana, IL;University of Illinois at Urbana Champaign, Urbana, IL
Venue:
Proceedings of twenty-first ACM SIGOPS symposium on Operating systems principles
Year:
2007

Citing 38
Cited 40

An empirical study of the reliability of UNIX utilities

Communications of the ACM
Improving the accuracy of data race detection

PPOPP '91 Proceedings of the third ACM SIGPLAN symposium on Principles and practice of parallel programming
Eraser: a dynamic data race detector for multithreaded programs

ACM Transactions on Computer Systems (TOCS)
Free transactions with Rio Vista

Proceedings of the sixteenth ACM symposium on Operating systems principles
Programmers use slices when debugging

Communications of the ACM
CCured: type-safe retrofitting of legacy code

POPL '02 Proceedings of the 29th ACM SIGPLAN-SIGACT symposium on Principles of programming languages
Isolating cause-effect chains from computer programs

Proceedings of the 10th ACM SIGSOFT symposium on Foundations of software engineering
An Execution-Backtracking Approach to Debugging

IEEE Software
Pinpoint: Problem Determination in Large, Dynamic Internet Services

DSN '02 Proceedings of the 2002 International Conference on Dependable Systems and Networks
Precise dynamic slicing algorithms

Proceedings of the 25th International Conference on Software Engineering
Bug isolation via remote program sampling

PLDI '03 Proceedings of the ACM SIGPLAN 2003 conference on Programming language design and implementation
A "flight data recorder" for enabling full-system multiprocessor deterministic replay

Proceedings of the 30th annual international symposium on Computer architecture
Performance debugging for distributed systems of black boxes

SOSP '03 Proceedings of the nineteenth ACM symposium on Operating systems principles
Honeycomb: creating intrusion detection signatures using honeypots

ACM SIGCOMM Computer Communication Review
PeerPressure for automatic troubleshooting

Proceedings of the joint international conference on Measurement and modeling of computer systems
Low-overhead memory leak detection using adaptive statistical profiling

ASPLOS XI Proceedings of the 11th international conference on Architectural support for programming languages and operating systems
PSE: explaining program failures via postmortem static analysis

Proceedings of the 12th ACM SIGSOFT twelfth international symposium on Foundations of software engineering
SafeMem: Exploiting ECC-Memory for Detecting Memory Leaks and Memory Corruption During Production Runs

HPCA '05 Proceedings of the 11th International Symposium on High-Performance Computer Architecture
The design and implementation of Zap: a system for migrating computing environments

OSDI '02 Proceedings of the 5th symposium on Operating systems design and implementationCopyright restrictions prevent ACM from being able to make the PDFs for this conference available for downloading
Pin: building customized program analysis tools with dynamic instrumentation

Proceedings of the 2005 ACM SIGPLAN conference on Programming language design and implementation
BugNet: Continuously Recording Program Execution for Deterministic Replay Debugging

Proceedings of the 32nd annual international symposium on Computer Architecture
Efficient, transparent, and comprehensive runtime code manipulation

Efficient, transparent, and comprehensive runtime code manipulation
Vigilante: end-to-end containment of internet worms

Proceedings of the twentieth ACM symposium on Operating systems principles
Rx: treating bugs as allergies---a safe method to survive software failures

Proceedings of the twentieth ACM symposium on Operating systems principles
Towards Automatic Generation of Vulnerability-Based Signatures

SP '06 Proceedings of the 2006 IEEE Symposium on Security and Privacy
MisleadingWorm Signature Generators Using Deliberate Noise Injection

SP '06 Proceedings of the 2006 IEEE Symposium on Security and Privacy
DieHard: probabilistic memory safety for unsafe languages

Proceedings of the 2006 ACM SIGPLAN conference on Programming language design and implementation
HDD: hierarchical delta debugging

Proceedings of the 28th international conference on Software engineering
Debugging operating systems with time-traveling virtual machines

ATEC '05 Proceedings of the annual conference on USENIX Annual Technical Conference
Building a reactive immune system for software services

ATEC '05 Proceedings of the annual conference on USENIX Annual Technical Conference
Flashback: a lightweight extension for rollback and deterministic replay for software debugging

ATEC '04 Proceedings of the annual conference on USENIX Annual Technical Conference
Automated worm fingerprinting

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Enhancing server availability and security through failure-oblivious computing

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Autograph: toward automated, distributed worm signature detection

SSYM'04 Proceedings of the 13th conference on USENIX Security Symposium - Volume 13
Flight data recorder: monitoring persistent-state interactions to improve systems management

OSDI '06 Proceedings of the 7th USENIX Symposium on Operating Systems Design and Implementation - Volume 7
Replay debugging for distributed applications

ATEC '06 Proceedings of the annual conference on USENIX '06 Annual Technical Conference
Sweeper: a lightweight end-to-end system for defending against fast worms

Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007
Language-based information-flow security

IEEE Journal on Selected Areas in Communications

Better bug reporting with better privacy

Proceedings of the 13th international conference on Architectural support for programming languages and operating systems
Switchblade: enforcing dynamic personalized system call models

Proceedings of the 3rd ACM SIGOPS/EuroSys European Conference on Computer Systems 2008
Flexible Hardware Acceleration for Instruction-Grain Program Monitoring

ISCA '08 Proceedings of the 35th Annual International Symposium on Computer Architecture
Diagnosing distributed systems with self-propelled instrumentation

Proceedings of the 9th ACM/IFIP/USENIX International Conference on Middleware
Kendo: efficient deterministic multithreading in software

Proceedings of the 14th international conference on Architectural support for programming languages and operating systems
First-aid: surviving and preventing memory management bugs during production runs

Proceedings of the 4th ACM European conference on Computer systems
Transparent checkpoints of closed distributed systems in Emulab

Proceedings of the 4th ACM European conference on Computer systems
Reference-driven performance anomaly identification

Proceedings of the eleventh international joint conference on Measurement and modeling of computer systems
Penumbra: automatically identifying failure-relevant inputs using dynamic tainting

Proceedings of the eighteenth international symposium on Software testing and analysis
Debugging in the (very) large: ten years of implementation and experience

Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles
PRES: probabilistic replay with execution sketching on multiprocessors

Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles
Multi-stage replay with crosscut

Proceedings of the 6th ACM SIGPLAN/SIGOPS international conference on Virtual execution environments
Respec: efficient online multiprocessor replayvia speculation and external determinism

Proceedings of the fifteenth edition of ASPLOS on Architectural support for programming languages and operating systems
SherLog: error diagnosis by connecting clues from run-time logs

Proceedings of the fifteenth edition of ASPLOS on Architectural support for programming languages and operating systems
PinPlay: a framework for deterministic replay and reproducible analysis of parallel programs

Proceedings of the 8th annual IEEE/ACM international symposium on Code generation and optimization
An empirical study of reported bugs in server software with implications for automated bug diagnosis

Proceedings of the 32nd ACM/IEEE International Conference on Software Engineering - Volume 1
Adaptive system anomaly prediction for large-scale hosting infrastructures

Proceedings of the 29th ACM SIGACT-SIGOPS symposium on Principles of distributed computing
Lightweight, high-resolution monitoring for troubleshooting production systems

OSDI'08 Proceedings of the 8th USENIX conference on Operating systems design and implementation
DoublePlay: parallelizing sequential logging and replay

Proceedings of the sixteenth international conference on Architectural support for programming languages and operating systems
Dependence-based multi-level tracing and replay for wireless sensor networks debugging

Proceedings of the 2011 SIGPLAN/SIGBED conference on Languages, compilers and tools for embedded systems
Diagnosing performance changes by comparing request flows

Proceedings of the 8th USENIX conference on Networked systems design and implementation
ASDF: an automated, online framework for diagnosing performance problems

Architecting dependable systems VII
Static extraction of program configuration options

Proceedings of the 33rd International Conference on Software Engineering
More intervention now!

HotOS'13 Proceedings of the 13th USENIX conference on Hot topics in operating systems
Record and transplay: partial checkpointing for replay debugging across heterogeneous systems

Proceedings of the ACM SIGMETRICS joint international conference on Measurement and modeling of computer systems
Record and transplay: partial checkpointing for replay debugging across heterogeneous systems

ACM SIGMETRICS Performance Evaluation Review - Performance evaluation review
Bug localization in test-driven development

Advances in Software Engineering
Bootstrapping energy debugging on smartphones: a first look at energy bugs in mobile devices

Proceedings of the 10th ACM Workshop on Hot Topics in Networks
DoublePlay: Parallelizing Sequential Logging and Replay

ACM Transactions on Computer Systems (TOCS) - Special Issue APLOS 2011
Improving Software Diagnosability via Log Enhancement

ACM Transactions on Computer Systems (TOCS) - Special Issue APLOS 2011
Can deterministic replay be an enabling tool for mobile computing?

Proceedings of the 12th Workshop on Mobile Computing Systems and Applications
Precomputing possible configuration error diagnoses

ASE '11 Proceedings of the 2011 26th IEEE/ACM International Conference on Automated Software Engineering
Chimera: hybrid program analysis for determinism

Proceedings of the 33rd ACM SIGPLAN conference on Programming Language Design and Implementation
Stride: search-based deterministic replay in polynomial time via bounded linkage

Proceedings of the 34th International Conference on Software Engineering
A Recovery-Oriented Approach for Software Fault Diagnosis in Complex Critical Systems

International Journal of Adaptive, Resilient and Autonomic Systems
Fmeter: extracting indexable low-level system signatures by counting kernel function calls

Proceedings of the 13th International Middleware Conference
Using likely invariants for automated software fault localization

Proceedings of the eighteenth international conference on Architectural support for programming languages and operating systems
Composing OS extensions safely and efficiently with Bascule

Proceedings of the 8th ACM European Conference on Computer Systems
Leveraging the short-term memory of hardware to diagnose production-run software failures

Proceedings of the 19th international conference on Architectural support for programming languages and operating systems
Global property violation detection and diagnosis for wireless sensor networks

Proceedings of the 2013 International Conference on Compilers, Architectures and Synthesis for Embedded Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

Diagnosing production run failures is a challenging yet importanttask. Most previous work focuses on offsite diagnosis, i.e.development site diagnosis with the programmers present. This is insufficient for production-run failures as: (1) it is difficult to reproduce failures offsite for diagnosis; (2) offsite diagnosis cannot provide timely guidance for recovery or security purposes; (3)it is infeasible to provide a programmer to diagnose every production run failure; and (4) privacy concerns limit the release of information(e.g. coredumps) to programmers. To address production-run failures, we propose a system, called Triage, that automatically performs onsite software failure diagnosis at the very moment of failure. It provides a detailed diagnosis report, including the failure nature, triggering conditions, related code and variables, the fault propagation chain, and potential fixes. Triage achieves this by leveraging lightweight reexecution support to efficiently capture the failure environment and repeatedly replay the moment of failure, and dynamically--using different diagnosis techniques--analyze an occurring failure. Triage employs afailure diagnosis protocol that mimics the steps a human takes in debugging. This extensible protocol provides a framework to enable the use of various existing and new diagnosis techniques. We also propose a new failure diagnosis technique, delta analysis, to identify failure related conditions, code, and variables. We evaluate these ideas in real system experiments with 10 real software failures from 9 open source applications including four servers. Triage accurately diagnoses the evaluated failures, providing likely root causes and even the fault propagation chain, while keeping normal-run overhead to under 5%. Finally, our user study of the diagnosis and repair of real bugs shows that Triagesaves time (99.99% confidence), reducing the total time to fix by almost half.