Production-run software failure diagnosis via hardware performance counters

Authors:
Joy Arulraj;Po-Chun Chang;Guoliang Jin;Shan Lu
Affiliations:
University of Wisconsin, Madison, WI, USA;University of Wisconsin, Madison, WI, USA;University of Wisconsin, Madison, WI, USA;University of Wisconsin, Madison, WI, USA
Venue:
Proceedings of the eighteenth international conference on Architectural support for programming languages and operating systems
Year:
2013

Citing 29
Cited 1

Quickly detecting relevant program invariants

Proceedings of the 22nd international conference on Software engineering
Bugs as deviant behavior: a general approach to inferring errors in systems code

SOSP '01 Proceedings of the eighteenth ACM symposium on Operating systems principles
Tracking down software bugs using automatic anomaly detection

Proceedings of the 24th International Conference on Software Engineering
CIL: Intermediate Language and Tools for Analysis and Transformation of C Programs

CC '02 Proceedings of the 11th International Conference on Compiler Construction
Bug isolation via remote program sampling

PLDI '03 Proceedings of the ACM SIGPLAN 2003 conference on Programming language design and implementation
ReEnact: using thread-level speculation mechanisms to debug data races in multithreaded codes

Proceedings of the 30th annual international symposium on Computer architecture
iWatcher: Efficient Architectural Support for Software Debugging

Proceedings of the 31st annual international symposium on Computer architecture
Scalable statistical bug isolation

Proceedings of the 2005 ACM SIGPLAN conference on Programming language design and implementation
Associating synchronization constraints with data in an object-oriented language

Conference record of the 33rd ACM SIGPLAN-SIGACT symposium on Principles of programming languages
AVIO: detecting atomicity violations via access interleaving invariants

Proceedings of the 12th international conference on Architectural support for programming languages and operating systems
Have things changed now?: an empirical study of bug characteristics in modern open source software

Proceedings of the 1st workshop on Architectural and system support for improving software dependability
MemTracker: Efficient and Programmable Support for Memory Access Monitoring and Debugging

HPCA '07 Proceedings of the 2007 IEEE 13th International Symposium on High Performance Computer Architecture
HARD: Hardware-Assisted Lockset-based Race Detection

HPCA '07 Proceedings of the 2007 IEEE 13th International Symposium on High Performance Computer Architecture
Learning from mistakes: a comprehensive study on real world concurrency bug characteristics

Proceedings of the 13th international conference on Architectural support for programming languages and operating systems
LiteRace: effective sampling for lightweight data-race detection

Proceedings of the 2009 ACM SIGPLAN conference on Programming language design and implementation
Lightweight fault-localization using multiple coverage types

ICSE '09 Proceedings of the 31st International Conference on Software Engineering
A case for an interleaving constrained shared-memory multi-processor

Proceedings of the 36th annual international symposium on Computer architecture
Debugging in the (very) large: ten years of implementation and experience

Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles
Finding concurrency bugs with context-aware communication graphs

Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture
Locating cache performance bottlenecks using data profiling

Proceedings of the 5th European conference on Computer systems
Falcon: fault localization in concurrent programs

Proceedings of the 32nd ACM/IEEE International Conference on Software Engineering - Volume 1
Instrumentation and sampling strategies for cooperative concurrency bug isolation

Proceedings of the ACM international conference on Object oriented programming systems languages and applications
ConSeq: detecting concurrency bugs through sequential errors

Proceedings of the sixteenth international conference on Architectural support for programming languages and operating systems
RACEZ: a lightweight and non-invasive race detection tool for production applications

Proceedings of the 33rd International Conference on Software Engineering
Exploiting hardware advances for software testing and debugging (NIER track)

Proceedings of the 33rd International Conference on Software Engineering
Demand-driven software race detection using hardware performance counters

Proceedings of the 38th annual international symposium on Computer architecture
Rapid identification of architectural bottlenecks via precise event counting

Proceedings of the 38th annual international symposium on Computer architecture
How do fixes become bugs?

Proceedings of the 19th ACM SIGSOFT symposium and the 13th European conference on Foundations of software engineering
Security breaches as PMU deviation: detecting and identifying security attacks using performance counters

Proceedings of the Second Asia-Pacific Workshop on Systems

Leveraging the short-term memory of hardware to diagnose production-run software failures

Proceedings of the 19th international conference on Architectural support for programming languages and operating systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

Sequential and concurrency bugs are widespread in deployed software. They cause severe failures and huge financial loss during production runs. Tools that diagnose production-run failures with low overhead are needed. The state-of-the-art diagnosis techniques use software instrumentation to sample program properties at run time and use off-line statistical analysis to identify properties most correlated with failures. Although promising, these techniques suffer from high run-time overhead, which is sometimes over 100%, for concurrency-bug failure diagnosis and hence are not suitable for production-run usage. We present PBI, a system that uses existing hardware performance counters to diagnose production-run failures caused by sequential and concurrency bugs with low overhead. PBI is designed based on several key observations. First, a few widely supported performance counter events can reflect a wide variety of common software bugs and can be monitored by hardware with almost no overhead. Second, the counter overflow interrupt supported by existing hardware and operating systems provides a natural and effective mechanism to conduct event sampling at user level. Third, the noise and non-determinism in interrupt delivery complements well with statistical processing. We evaluate PBI using 13 real-world concurrency and sequential bugs from representative open-source server, client, and utility programs, and 10 bugs from a widely used software-testing benchmark. Quantitatively, PBI can effectively diagnose failures caused by these bugs with a small overhead that is never higher than 10%. Qualitatively, PBI does not require any change to software and presents a novel use of existing hardware performance counters.