An empirical study of the reliability of UNIX utilities
Communications of the ACM
Improving the accuracy of data race detection
PPOPP '91 Proceedings of the third ACM SIGPLAN symposium on Principles and practice of parallel programming
Eraser: a dynamic data race detector for multithreaded programs
ACM Transactions on Computer Systems (TOCS)
Free transactions with Rio Vista
Proceedings of the sixteenth ACM symposium on Operating systems principles
Programmers use slices when debugging
Communications of the ACM
CCured: type-safe retrofitting of legacy code
POPL '02 Proceedings of the 29th ACM SIGPLAN-SIGACT symposium on Principles of programming languages
Isolating cause-effect chains from computer programs
Proceedings of the 10th ACM SIGSOFT symposium on Foundations of software engineering
An Execution-Backtracking Approach to Debugging
IEEE Software
Pinpoint: Problem Determination in Large, Dynamic Internet Services
DSN '02 Proceedings of the 2002 International Conference on Dependable Systems and Networks
Precise dynamic slicing algorithms
Proceedings of the 25th International Conference on Software Engineering
Bug isolation via remote program sampling
PLDI '03 Proceedings of the ACM SIGPLAN 2003 conference on Programming language design and implementation
A "flight data recorder" for enabling full-system multiprocessor deterministic replay
Proceedings of the 30th annual international symposium on Computer architecture
Performance debugging for distributed systems of black boxes
SOSP '03 Proceedings of the nineteenth ACM symposium on Operating systems principles
Honeycomb: creating intrusion detection signatures using honeypots
ACM SIGCOMM Computer Communication Review
PeerPressure for automatic troubleshooting
Proceedings of the joint international conference on Measurement and modeling of computer systems
Low-overhead memory leak detection using adaptive statistical profiling
ASPLOS XI Proceedings of the 11th international conference on Architectural support for programming languages and operating systems
PSE: explaining program failures via postmortem static analysis
Proceedings of the 12th ACM SIGSOFT twelfth international symposium on Foundations of software engineering
HPCA '05 Proceedings of the 11th International Symposium on High-Performance Computer Architecture
The design and implementation of Zap: a system for migrating computing environments
OSDI '02 Proceedings of the 5th symposium on Operating systems design and implementationCopyright restrictions prevent ACM from being able to make the PDFs for this conference available for downloading
Pin: building customized program analysis tools with dynamic instrumentation
Proceedings of the 2005 ACM SIGPLAN conference on Programming language design and implementation
BugNet: Continuously Recording Program Execution for Deterministic Replay Debugging
Proceedings of the 32nd annual international symposium on Computer Architecture
Efficient, transparent, and comprehensive runtime code manipulation
Efficient, transparent, and comprehensive runtime code manipulation
Vigilante: end-to-end containment of internet worms
Proceedings of the twentieth ACM symposium on Operating systems principles
Rx: treating bugs as allergies---a safe method to survive software failures
Proceedings of the twentieth ACM symposium on Operating systems principles
Towards Automatic Generation of Vulnerability-Based Signatures
SP '06 Proceedings of the 2006 IEEE Symposium on Security and Privacy
MisleadingWorm Signature Generators Using Deliberate Noise Injection
SP '06 Proceedings of the 2006 IEEE Symposium on Security and Privacy
DieHard: probabilistic memory safety for unsafe languages
Proceedings of the 2006 ACM SIGPLAN conference on Programming language design and implementation
HDD: hierarchical delta debugging
Proceedings of the 28th international conference on Software engineering
Debugging operating systems with time-traveling virtual machines
ATEC '05 Proceedings of the annual conference on USENIX Annual Technical Conference
Building a reactive immune system for software services
ATEC '05 Proceedings of the annual conference on USENIX Annual Technical Conference
Flashback: a lightweight extension for rollback and deterministic replay for software debugging
ATEC '04 Proceedings of the annual conference on USENIX Annual Technical Conference
OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Enhancing server availability and security through failure-oblivious computing
OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Autograph: toward automated, distributed worm signature detection
SSYM'04 Proceedings of the 13th conference on USENIX Security Symposium - Volume 13
Flight data recorder: monitoring persistent-state interactions to improve systems management
OSDI '06 Proceedings of the 7th USENIX Symposium on Operating Systems Design and Implementation - Volume 7
Replay debugging for distributed applications
ATEC '06 Proceedings of the annual conference on USENIX '06 Annual Technical Conference
Sweeper: a lightweight end-to-end system for defending against fast worms
Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007
Language-based information-flow security
IEEE Journal on Selected Areas in Communications
Better bug reporting with better privacy
Proceedings of the 13th international conference on Architectural support for programming languages and operating systems
Switchblade: enforcing dynamic personalized system call models
Proceedings of the 3rd ACM SIGOPS/EuroSys European Conference on Computer Systems 2008
Flexible Hardware Acceleration for Instruction-Grain Program Monitoring
ISCA '08 Proceedings of the 35th Annual International Symposium on Computer Architecture
Diagnosing distributed systems with self-propelled instrumentation
Proceedings of the 9th ACM/IFIP/USENIX International Conference on Middleware
Kendo: efficient deterministic multithreading in software
Proceedings of the 14th international conference on Architectural support for programming languages and operating systems
First-aid: surviving and preventing memory management bugs during production runs
Proceedings of the 4th ACM European conference on Computer systems
Transparent checkpoints of closed distributed systems in Emulab
Proceedings of the 4th ACM European conference on Computer systems
Reference-driven performance anomaly identification
Proceedings of the eleventh international joint conference on Measurement and modeling of computer systems
Penumbra: automatically identifying failure-relevant inputs using dynamic tainting
Proceedings of the eighteenth international symposium on Software testing and analysis
Debugging in the (very) large: ten years of implementation and experience
Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles
PRES: probabilistic replay with execution sketching on multiprocessors
Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles
Multi-stage replay with crosscut
Proceedings of the 6th ACM SIGPLAN/SIGOPS international conference on Virtual execution environments
Respec: efficient online multiprocessor replayvia speculation and external determinism
Proceedings of the fifteenth edition of ASPLOS on Architectural support for programming languages and operating systems
SherLog: error diagnosis by connecting clues from run-time logs
Proceedings of the fifteenth edition of ASPLOS on Architectural support for programming languages and operating systems
PinPlay: a framework for deterministic replay and reproducible analysis of parallel programs
Proceedings of the 8th annual IEEE/ACM international symposium on Code generation and optimization
An empirical study of reported bugs in server software with implications for automated bug diagnosis
Proceedings of the 32nd ACM/IEEE International Conference on Software Engineering - Volume 1
Adaptive system anomaly prediction for large-scale hosting infrastructures
Proceedings of the 29th ACM SIGACT-SIGOPS symposium on Principles of distributed computing
Lightweight, high-resolution monitoring for troubleshooting production systems
OSDI'08 Proceedings of the 8th USENIX conference on Operating systems design and implementation
DoublePlay: parallelizing sequential logging and replay
Proceedings of the sixteenth international conference on Architectural support for programming languages and operating systems
Dependence-based multi-level tracing and replay for wireless sensor networks debugging
Proceedings of the 2011 SIGPLAN/SIGBED conference on Languages, compilers and tools for embedded systems
Diagnosing performance changes by comparing request flows
Proceedings of the 8th USENIX conference on Networked systems design and implementation
ASDF: an automated, online framework for diagnosing performance problems
Architecting dependable systems VII
Static extraction of program configuration options
Proceedings of the 33rd International Conference on Software Engineering
HotOS'13 Proceedings of the 13th USENIX conference on Hot topics in operating systems
Record and transplay: partial checkpointing for replay debugging across heterogeneous systems
Proceedings of the ACM SIGMETRICS joint international conference on Measurement and modeling of computer systems
Record and transplay: partial checkpointing for replay debugging across heterogeneous systems
ACM SIGMETRICS Performance Evaluation Review - Performance evaluation review
Bug localization in test-driven development
Advances in Software Engineering
Bootstrapping energy debugging on smartphones: a first look at energy bugs in mobile devices
Proceedings of the 10th ACM Workshop on Hot Topics in Networks
DoublePlay: Parallelizing Sequential Logging and Replay
ACM Transactions on Computer Systems (TOCS) - Special Issue APLOS 2011
Improving Software Diagnosability via Log Enhancement
ACM Transactions on Computer Systems (TOCS) - Special Issue APLOS 2011
Can deterministic replay be an enabling tool for mobile computing?
Proceedings of the 12th Workshop on Mobile Computing Systems and Applications
Precomputing possible configuration error diagnoses
ASE '11 Proceedings of the 2011 26th IEEE/ACM International Conference on Automated Software Engineering
Chimera: hybrid program analysis for determinism
Proceedings of the 33rd ACM SIGPLAN conference on Programming Language Design and Implementation
Stride: search-based deterministic replay in polynomial time via bounded linkage
Proceedings of the 34th International Conference on Software Engineering
A Recovery-Oriented Approach for Software Fault Diagnosis in Complex Critical Systems
International Journal of Adaptive, Resilient and Autonomic Systems
Fmeter: extracting indexable low-level system signatures by counting kernel function calls
Proceedings of the 13th International Middleware Conference
Using likely invariants for automated software fault localization
Proceedings of the eighteenth international conference on Architectural support for programming languages and operating systems
Composing OS extensions safely and efficiently with Bascule
Proceedings of the 8th ACM European Conference on Computer Systems
Leveraging the short-term memory of hardware to diagnose production-run software failures
Proceedings of the 19th international conference on Architectural support for programming languages and operating systems
Global property violation detection and diagnosis for wireless sensor networks
Proceedings of the 2013 International Conference on Compilers, Architectures and Synthesis for Embedded Systems
Hi-index | 0.00 |
Diagnosing production run failures is a challenging yet importanttask. Most previous work focuses on offsite diagnosis, i.e.development site diagnosis with the programmers present. This is insufficient for production-run failures as: (1) it is difficult to reproduce failures offsite for diagnosis; (2) offsite diagnosis cannot provide timely guidance for recovery or security purposes; (3)it is infeasible to provide a programmer to diagnose every production run failure; and (4) privacy concerns limit the release of information(e.g. coredumps) to programmers. To address production-run failures, we propose a system, called Triage, that automatically performs onsite software failure diagnosis at the very moment of failure. It provides a detailed diagnosis report, including the failure nature, triggering conditions, related code and variables, the fault propagation chain, and potential fixes. Triage achieves this by leveraging lightweight reexecution support to efficiently capture the failure environment and repeatedly replay the moment of failure, and dynamically--using different diagnosis techniques--analyze an occurring failure. Triage employs afailure diagnosis protocol that mimics the steps a human takes in debugging. This extensible protocol provides a framework to enable the use of various existing and new diagnosis techniques. We also propose a new failure diagnosis technique, delta analysis, to identify failure related conditions, code, and variables. We evaluate these ideas in real system experiments with 10 real software failures from 9 open source applications including four servers. Triage accurately diagnoses the evaluated failures, providing likely root causes and even the fault propagation chain, while keeping normal-run overhead to under 5%. Finally, our user study of the diagnosis and repair of real bugs shows that Triagesaves time (99.99% confidence), reducing the total time to fix by almost half.