Vrisha: using scaling properties of parallel programs for bug detection and localization

Authors:
Bowen Zhou;Milind Kulkarni;Saurabh Bagchi
Affiliations:
Purdue University, West Lafayette, IN, USA;Purdue University, West Lafayette, IN, USA;Purdue University, West Lafayette, IN, USA
Venue:
Proceedings of the 20th international symposium on High performance distributed computing
Year:
2011

Citing 16
Cited 3

NAS parallel benchmark results

Proceedings of the 1992 ACM/IEEE conference on Supercomputing
A framework for performance modeling and prediction

Proceedings of the 2002 ACM/IEEE conference on Supercomputing
Kernel independent component analysis

The Journal of Machine Learning Research
Kernel Methods for Pattern Analysis

Kernel Methods for Pattern Analysis
A large-scale study of failures in high-performance computing systems

DSN '06 Proceedings of the International Conference on Dependable Systems and Networks
Problem diagnosis in large-scale computing environments

Proceedings of the 2006 ACM/IEEE conference on Supercomputing
Adaptive correctness monitoring for wireless sensor networks using hierarchical distributed run-time invariant checking

ACM Transactions on Autonomous and Adaptive Systems (TAAS)
DMTracker: finding bugs in large-scale parallel programs by detecting anomaly in data movements

Proceedings of the 2007 ACM/IEEE conference on Supercomputing
Exploring event correlation for failure prediction in coalitions of clusters

Proceedings of the 2007 ACM/IEEE conference on Supercomputing
Lessons learned at 208K: towards debugging millions of cores

Proceedings of the 2008 ACM/IEEE conference on Supercomputing
Predicting Multiple Metrics for Queries: Better Decisions Enabled by Machine Learning

ICDE '09 Proceedings of the 2009 IEEE International Conference on Data Engineering
Scalable temporal order analysis for large scale debugging

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Black-box problem diagnosis in parallel file systems

FAST'10 Proceedings of the 8th USENIX conference on File and storage technologies
A case for machine learning to optimize multicore performance

HotPar'09 Proceedings of the First USENIX conference on Hot topics in parallelism
FlowChecker: Detecting Bugs in MPI Libraries via Message Flow Checking

Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
ScalaExtrap: trace-based communication extrapolation for spmd programs

Proceedings of the 16th ACM symposium on Principles and practice of parallel programming

ABHRANTA: locating bugs that manifest at large system scales

HotDep'12 Proceedings of the Eighth USENIX conference on Hot Topics in System Dependability
WuKong: effective diagnosis of bugs at large system scales

Proceedings of the 18th ACM SIGPLAN symposium on Principles and practice of parallel programming
WuKong: automatically detecting and localizing bugs that manifest at large system scales

Proceedings of the 22nd international symposium on High-performance parallel and distributed computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

Detecting and isolating bugs that arise in parallel programs is a tedious and a challenging task. An especially subtle class of bugs are those that are scale-dependent: while small-scale test cases may not exhibit the bug, the bug arises in large-scale production runs, and can change the result or performance of an application. A popular approach to finding bugs is statistical bug detection, where abnormal behavior is detected through comparison with bug-free behavior. Unfortunately, for scale-dependent bugs, there may not be bug-free runs at large scales and therefore traditional statistical techniques are not viable. In this paper, we propose Vrisha, a statistical approach to detecting and localizing scale-dependent bugs. Vrisha detects bugs in large-scale programs by building models of behavior based on bug-free behavior at small scales. These models are constructed using kernel canonical correlation analysis (KCCA) and exploit scale-determined properties, whose values are predictably dependent on application scale. We use Vrisha to detect and diagnose two bugs caused by errors in popular MPI libraries and show that our techniques can be implemented with low overhead and low false-positive rates.