WuKong: automatically detecting and localizing bugs that manifest at large system scales

Authors:
Bowen Zhou;Jonathan Too;Milind Kulkarni;Saurabh Bagchi
Affiliations:
Purdue University, West Lafayette, IN, USA;Purdue University, West Lafayette, IN, USA;Purdue University, West Lafayette, IN, USA;Purdue University, West Lafayette, IN, USA
Venue:
Proceedings of the 22nd international symposium on High-performance parallel and distributed computing
Year:
2013

Citing 16
Cited 0

Bug isolation via remote program sampling

PLDI '03 Proceedings of the ACM SIGPLAN 2003 conference on Programming language design and implementation
Scalable statistical bug isolation

Proceedings of the 2005 ACM SIGPLAN conference on Programming language design and implementation
Pin: building customized program analysis tools with dynamic instrumentation

Proceedings of the 2005 ACM SIGPLAN conference on Programming language design and implementation
Empirical evaluation of the tarantula automatic fault-localization technique

Proceedings of the 20th IEEE/ACM international Conference on Automated software engineering
Statistical debugging: simultaneous identification of multiple bugs

ICML '06 Proceedings of the 23rd international conference on Machine learning
Problem diagnosis in large-scale computing environments

Proceedings of the 2006 ACM/IEEE conference on Supercomputing
Statistical Debugging: A Hypothesis Testing-Based Approach

IEEE Transactions on Software Engineering
Probabilistic calling context

Proceedings of the 22nd annual ACM SIGPLAN conference on Object-oriented programming systems and applications
DMTracker: finding bugs in large-scale parallel programs by detecting anomaly in data movements

Proceedings of the 2007 ACM/IEEE conference on Supercomputing
A regression-based approach to scalability prediction

Proceedings of the 22nd annual international conference on Supercomputing
Lessons learned at 208K: towards debugging millions of cores

Proceedings of the 2008 ACM/IEEE conference on Supercomputing
Statistical Debugging Using Latent Topic Models

ECML '07 Proceedings of the 18th European conference on Machine Learning
HOLMES: Effective statistical debugging via efficient path profiling

ICSE '09 Proceedings of the 31st International Conference on Software Engineering
Breadcrumbs: efficient context sensitivity for dynamic bug detection analyses

PLDI '10 Proceedings of the 2010 ACM SIGPLAN conference on Programming language design and implementation
Vrisha: using scaling properties of parallel programs for bug detection and localization

Proceedings of the 20th international symposium on High performance distributed computing
ABHRANTA: locating bugs that manifest at large system scales

HotDep'12 Proceedings of the Eighth USENIX conference on Hot Topics in System Dependability

Quantified Score

Hi-index	0.00

Visualization

Abstract

A key challenge in developing large scale applications is finding bugs that are latent at the small scales of testing, but manifest themselves when the application is deployed at a large scale. Here, we ascribe a dual meaning to "large scale"---it could mean a large number of executing processes or applications ingesting large amounts of input data (or both). Traditional statistical techniques fail to detect or diagnose such kinds of bugs because no error-free run is available at the large deployment scales for training purposes. Prior work used scaling models to detect anomalous behavior at large scales without training on correct behavior at that scale. However, that work cannot localize bugs automatically, i.e. cannot pinpoint the region of code responsible for the error. In this paper, we resolve that shortcoming by making the following three contributions: (i) we develop an automatic diagnosis technique, based on feature reconstruction; (ii) we design a heuristic to effectively prune the large feature space; and (iii) we demonstrate that our system scales well, in terms of both accuracy and overhead. We validate our design through a large-scale fault-injection study and two case-studies of real-world bugs, finding that our system can effectively localize bugs in 92.5% of the cases, dramatically easing the challenge of finding bugs in large-scale programs.