Be conservative: enhancing failure diagnosis with proactive logging

Authors:
Ding Yuan;Soyeon Park;Peng Huang;Yang Liu;Michael M. Lee;Xiaoming Tang;Yuanyuan Zhou;Stefan Savage
Affiliations:
University of Illinois at Urbana-Champaign and University of California, San Diego;University of Illinois at Urbana-Champaign and University of California, San Diego;University of California, San Diego;University of California, San Diego;University of California, San Diego;University of California, San Diego;University of California, San Diego;University of California, San Diego
Venue:
OSDI'12 Proceedings of the 10th USENIX conference on Operating Systems Design and Implementation
Year:
2012

Citing 31
Cited 3

Quickly detecting relevant program invariants

Proceedings of the 22nd international conference on Software engineering
Bugs as deviant behavior: a general approach to inferring errors in systems code

SOSP '01 Proceedings of the eighteenth ACM symposium on Operating systems principles
Bug isolation via remote program sampling

PLDI '03 Proceedings of the ACM SIGPLAN 2003 conference on Programming language design and implementation
Performance debugging for distributed systems of black boxes

SOSP '03 Proceedings of the nineteenth ACM symposium on Operating systems principles
Low-overhead memory leak detection using adaptive statistical profiling

ASPLOS XI Proceedings of the 11th international conference on Architectural support for programming languages and operating systems
ReVirt: enabling intrusion analysis through virtual-machine logging and replay

OSDI '02 Proceedings of the 5th symposium on Operating systems design and implementationCopyright restrictions prevent ACM from being able to make the PDFs for this conference available for downloading
Bell: bit-encoding online memory leak detection

Proceedings of the 12th international conference on Architectural support for programming languages and operating systems
Using model checking to find serious file system errors

ACM Transactions on Computer Systems (TOCS)
An overview of the saturn project

PASTE '07 Proceedings of the 7th ACM SIGPLAN-SIGSOFT workshop on Program analysis for software tools and engineering
Pip: detecting the unexpected in distributed systems

NSDI'06 Proceedings of the 3rd conference on Networked Systems Design & Implementation - Volume 3
Tracking bad apples: reporting the origin of null and undefined value errors

Proceedings of the 22nd annual ACM SIGPLAN conference on Object-oriented programming systems and applications
Better bug reporting with better privacy

Proceedings of the 13th international conference on Architectural support for programming languages and operating systems
EIO: error handling is occasionally correct

FAST'08 Proceedings of the 6th USENIX Conference on File and Storage Technologies
Error propagation analysis for file systems

Proceedings of the 2009 ACM SIGPLAN conference on Programming language design and implementation
Debugging in the (very) large: ten years of implementation and experience

Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles
Detecting large-scale system problems by mining console logs

Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles
ODR: output-deterministic replay for multicore debugging

Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles
SherLog: error diagnosis by connecting clues from run-time logs

Proceedings of the fifteenth edition of ASPLOS on Architectural support for programming languages and operating systems
An empirical study of reported bugs in server software with implications for automated bug diagnosis

Proceedings of the 32nd ACM/IEEE International Conference on Software Engineering - Volume 1
KLEE: unassisted and automatic generation of high-coverage tests for complex systems programs

OSDI'08 Proceedings of the 8th USENIX conference on Operating systems design and implementation
Dependable computing: concepts, limits, challenges

FTCS'95 Proceedings of the Twenty-Fifth international conference on Fault-tolerant computing
Low-overhead bug fingerprinting for fast debugging

RV'10 Proceedings of the First international conference on Runtime verification
DoublePlay: parallelizing sequential logging and replay

Proceedings of the sixteenth international conference on Architectural support for programming languages and operating systems
S2E: a platform for in-vivo multi-path analysis of software systems

Proceedings of the sixteenth international conference on Architectural support for programming languages and operating systems
FATE and DESTINI: a framework for cloud recovery testing

Proceedings of the 8th USENIX conference on Networked systems design and implementation
Record and transplay: partial checkpointing for replay debugging across heterogeneous systems

Proceedings of the ACM SIGMETRICS joint international conference on Measurement and modeling of computer systems
G2: a graph processing system for diagnosing distributed systems

USENIXATC'11 Proceedings of the 2011 USENIX conference on USENIX annual technical conference
How do fixes become bugs?

Proceedings of the 19th ACM SIGSOFT symposium and the 13th European conference on Foundations of software engineering
Efficient Testing of Recovery Code Using Fault Injection

ACM Transactions on Computer Systems (TOCS)
Improving Software Diagnosability via Log Enhancement

ACM Transactions on Computer Systems (TOCS) - Special Issue APLOS 2011
Characterizing logging practices in open-source software

Proceedings of the 34th International Conference on Software Engineering

Report on the international symposium on high confidence software (ISHCS 2011/2012)

ACM SIGSOFT Software Engineering Notes
Leveraging the short-term memory of hardware to diagnose production-run software failures

Proceedings of the 19th international conference on Architectural support for programming languages and operating systems
Challenges to error diagnosis in hadoop ecosystems

LISA'13 Proceedings of the 27th international conference on Large Installation System Administration

Quantified Score

Hi-index	0.00

Visualization

Abstract

When systems fail in the field, logged error or warning messages are frequently the only evidence available for assessing and diagnosing the underlying cause. Consequently, the efficacy of such logging--how often and how well error causes can be determined via postmortem log messages--is a matter of significant practical importance. However, there is little empirical data about how well existing logging practices work and how they can yet be improved. We describe a comprehensive study characterizing the efficacy of logging practices across five large and widely used software systems. Across 250 randomly sampled reported failures, we first identify that more than half of the failures could not be diagnosed well using existing log data. Surprisingly, we find that majority of these unreported failures are manifested via a common set of generic error patterns (e.g., system call return errors) that, if logged, can significantly ease the diagnosis of these unreported failure cases. We further mechanize this knowledge in a tool called Errlog, that proactively adds appropriate logging statements into source code while adding only 1.4% performance overhead. A controlled user study suggests that Errlog can reduce diagnosis time by 60.7%.