Quickly detecting relevant program invariants
Proceedings of the 22nd international conference on Software engineering
Bugs as deviant behavior: a general approach to inferring errors in systems code
SOSP '01 Proceedings of the eighteenth ACM symposium on Operating systems principles
Bug isolation via remote program sampling
PLDI '03 Proceedings of the ACM SIGPLAN 2003 conference on Programming language design and implementation
Performance debugging for distributed systems of black boxes
SOSP '03 Proceedings of the nineteenth ACM symposium on Operating systems principles
Low-overhead memory leak detection using adaptive statistical profiling
ASPLOS XI Proceedings of the 11th international conference on Architectural support for programming languages and operating systems
ReVirt: enabling intrusion analysis through virtual-machine logging and replay
OSDI '02 Proceedings of the 5th symposium on Operating systems design and implementationCopyright restrictions prevent ACM from being able to make the PDFs for this conference available for downloading
Bell: bit-encoding online memory leak detection
Proceedings of the 12th international conference on Architectural support for programming languages and operating systems
Using model checking to find serious file system errors
ACM Transactions on Computer Systems (TOCS)
An overview of the saturn project
PASTE '07 Proceedings of the 7th ACM SIGPLAN-SIGSOFT workshop on Program analysis for software tools and engineering
Pip: detecting the unexpected in distributed systems
NSDI'06 Proceedings of the 3rd conference on Networked Systems Design & Implementation - Volume 3
Tracking bad apples: reporting the origin of null and undefined value errors
Proceedings of the 22nd annual ACM SIGPLAN conference on Object-oriented programming systems and applications
Better bug reporting with better privacy
Proceedings of the 13th international conference on Architectural support for programming languages and operating systems
EIO: error handling is occasionally correct
FAST'08 Proceedings of the 6th USENIX Conference on File and Storage Technologies
Error propagation analysis for file systems
Proceedings of the 2009 ACM SIGPLAN conference on Programming language design and implementation
Debugging in the (very) large: ten years of implementation and experience
Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles
Detecting large-scale system problems by mining console logs
Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles
ODR: output-deterministic replay for multicore debugging
Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles
SherLog: error diagnosis by connecting clues from run-time logs
Proceedings of the fifteenth edition of ASPLOS on Architectural support for programming languages and operating systems
An empirical study of reported bugs in server software with implications for automated bug diagnosis
Proceedings of the 32nd ACM/IEEE International Conference on Software Engineering - Volume 1
KLEE: unassisted and automatic generation of high-coverage tests for complex systems programs
OSDI'08 Proceedings of the 8th USENIX conference on Operating systems design and implementation
Dependable computing: concepts, limits, challenges
FTCS'95 Proceedings of the Twenty-Fifth international conference on Fault-tolerant computing
Low-overhead bug fingerprinting for fast debugging
RV'10 Proceedings of the First international conference on Runtime verification
DoublePlay: parallelizing sequential logging and replay
Proceedings of the sixteenth international conference on Architectural support for programming languages and operating systems
S2E: a platform for in-vivo multi-path analysis of software systems
Proceedings of the sixteenth international conference on Architectural support for programming languages and operating systems
FATE and DESTINI: a framework for cloud recovery testing
Proceedings of the 8th USENIX conference on Networked systems design and implementation
Record and transplay: partial checkpointing for replay debugging across heterogeneous systems
Proceedings of the ACM SIGMETRICS joint international conference on Measurement and modeling of computer systems
G2: a graph processing system for diagnosing distributed systems
USENIXATC'11 Proceedings of the 2011 USENIX conference on USENIX annual technical conference
Proceedings of the 19th ACM SIGSOFT symposium and the 13th European conference on Foundations of software engineering
Efficient Testing of Recovery Code Using Fault Injection
ACM Transactions on Computer Systems (TOCS)
Improving Software Diagnosability via Log Enhancement
ACM Transactions on Computer Systems (TOCS) - Special Issue APLOS 2011
Characterizing logging practices in open-source software
Proceedings of the 34th International Conference on Software Engineering
Report on the international symposium on high confidence software (ISHCS 2011/2012)
ACM SIGSOFT Software Engineering Notes
Leveraging the short-term memory of hardware to diagnose production-run software failures
Proceedings of the 19th international conference on Architectural support for programming languages and operating systems
Challenges to error diagnosis in hadoop ecosystems
LISA'13 Proceedings of the 27th international conference on Large Installation System Administration
Hi-index | 0.00 |
When systems fail in the field, logged error or warning messages are frequently the only evidence available for assessing and diagnosing the underlying cause. Consequently, the efficacy of such logging--how often and how well error causes can be determined via postmortem log messages--is a matter of significant practical importance. However, there is little empirical data about how well existing logging practices work and how they can yet be improved. We describe a comprehensive study characterizing the efficacy of logging practices across five large and widely used software systems. Across 250 randomly sampled reported failures, we first identify that more than half of the failures could not be diagnosed well using existing log data. Surprisingly, we find that majority of these unreported failures are manifested via a common set of generic error patterns (e.g., system call return errors) that, if logged, can significantly ease the diagnosis of these unreported failure cases. We further mechanize this knowledge in a tool called Errlog, that proactively adds appropriate logging statements into source code while adding only 1.4% performance overhead. A controlled user study suggests that Errlog can reduce diagnosis time by 60.7%.