WuKong: automatically detecting and localizing bugs that manifest at large system scales

  • Authors:
  • Bowen Zhou;Jonathan Too;Milind Kulkarni;Saurabh Bagchi

  • Affiliations:
  • Purdue University, West Lafayette, IN, USA;Purdue University, West Lafayette, IN, USA;Purdue University, West Lafayette, IN, USA;Purdue University, West Lafayette, IN, USA

  • Venue:
  • Proceedings of the 22nd international symposium on High-performance parallel and distributed computing
  • Year:
  • 2013

Quantified Score

Hi-index 0.00

Visualization

Abstract

A key challenge in developing large scale applications is finding bugs that are latent at the small scales of testing, but manifest themselves when the application is deployed at a large scale. Here, we ascribe a dual meaning to "large scale"---it could mean a large number of executing processes or applications ingesting large amounts of input data (or both). Traditional statistical techniques fail to detect or diagnose such kinds of bugs because no error-free run is available at the large deployment scales for training purposes. Prior work used scaling models to detect anomalous behavior at large scales without training on correct behavior at that scale. However, that work cannot localize bugs automatically, i.e. cannot pinpoint the region of code responsible for the error. In this paper, we resolve that shortcoming by making the following three contributions: (i) we develop an automatic diagnosis technique, based on feature reconstruction; (ii) we design a heuristic to effectively prune the large feature space; and (iii) we demonstrate that our system scales well, in terms of both accuracy and overhead. We validate our design through a large-scale fault-injection study and two case-studies of real-world bugs, finding that our system can effectively localize bugs in 92.5% of the cases, dramatically easing the challenge of finding bugs in large-scale programs.