Leveraging the short-term memory of hardware to diagnose production-run software failures

Authors:
Joy Arulraj;Guoliang Jin;Shan Lu
Affiliations:
University of Wisconsin, Madison, WI, USA;University of Wisconsin, Madison, WI, USA;University of Wisconsin, Madison, WI, USA
Venue:
Proceedings of the 19th international conference on Architectural support for programming languages and operating systems
Year:
2014

Citing 40
Cited 0

An Investigation of the Therac-25 Accidents

Computer
Bug isolation via remote program sampling

PLDI '03 Proceedings of the ACM SIGPLAN 2003 conference on Programming language design and implementation
ReEnact: using thread-level speculation mechanisms to debug data races in multithreaded codes

Proceedings of the 30th annual international symposium on Computer architecture
iWatcher: Efficient Architectural Support for Software Debugging

Proceedings of the 31st annual international symposium on Computer architecture
Scalable statistical bug isolation

Proceedings of the 2005 ACM SIGPLAN conference on Programming language design and implementation
Pin: building customized program analysis tools with dynamic instrumentation

Proceedings of the 2005 ACM SIGPLAN conference on Programming language design and implementation
BugNet: Continuously Recording Program Execution for Deterministic Replay Debugging

Proceedings of the 32nd annual international symposium on Computer Architecture
An API for Runtime Code Patching

International Journal of High Performance Computing Applications
Rx: treating bugs as allergies---a safe method to survive software failures

Proceedings of the twentieth ACM symposium on Operating systems principles
AVIO: detecting atomicity violations via access interleaving invariants

Proceedings of the 12th international conference on Architectural support for programming languages and operating systems
Debugging operating systems with time-traveling virtual machines

ATEC '05 Proceedings of the annual conference on USENIX Annual Technical Conference
Triage: diagnosing production run failures at the user's site

Proceedings of twenty-first ACM SIGOPS symposium on Operating systems principles
MemTracker: Efficient and Programmable Support for Memory Access Monitoring and Debugging

HPCA '07 Proceedings of the 2007 IEEE 13th International Symposium on High Performance Computer Architecture
Colorama: Architectural Support for Data-Centric Synchronization

HPCA '07 Proceedings of the 2007 IEEE 13th International Symposium on High Performance Computer Architecture
HARD: Hardware-Assisted Lockset-based Race Detection

HPCA '07 Proceedings of the 2007 IEEE 13th International Symposium on High Performance Computer Architecture
Learning from mistakes: a comprehensive study on real world concurrency bug characteristics

Proceedings of the 13th international conference on Architectural support for programming languages and operating systems
Two hardware-based approaches for deterministic multiprocessor replay

Communications of the ACM - One Laptop Per Child: Vision vs. Reality
Lightweight fault-localization using multiple coverage types

ICSE '09 Proceedings of the 31st International Conference on Software Engineering
A case for an interleaving constrained shared-memory multi-processor

Proceedings of the 36th annual international symposium on Computer architecture
ECMon: exposing cache events for monitoring

Proceedings of the 36th annual international symposium on Computer architecture
ODR: output-deterministic replay for multicore debugging

Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles
Finding concurrency bugs with context-aware communication graphs

Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture
Respec: efficient online multiprocessor replayvia speculation and external determinism

Proceedings of the fifteenth edition of ASPLOS on Architectural support for programming languages and operating systems
SherLog: error diagnosis by connecting clues from run-time logs

Proceedings of the fifteenth edition of ASPLOS on Architectural support for programming languages and operating systems
ConMem: detecting severe concurrency bugs through an effect-oriented approach

Proceedings of the fifteenth edition of ASPLOS on Architectural support for programming languages and operating systems
Kivati: fast detection and prevention of atomicity violations

Proceedings of the 5th European conference on Computer systems
Adaptive bug isolation

Proceedings of the 32nd ACM/IEEE International Conference on Software Engineering - Volume 1
Instrumentation and sampling strategies for cooperative concurrency bug isolation

Proceedings of the ACM international conference on Object oriented programming systems languages and applications
Improving software diagnosability via log enhancement

Proceedings of the sixteenth international conference on Architectural support for programming languages and operating systems
ConSeq: detecting concurrency bugs through sequential errors

Proceedings of the sixteenth international conference on Architectural support for programming languages and operating systems
RACEZ: a lightweight and non-invasive race detection tool for production applications

Proceedings of the 33rd International Conference on Software Engineering
Demand-driven software race detection using hardware performance counters

Proceedings of the 38th annual international symposium on Computer architecture
Rapid identification of architectural bottlenecks via precise event counting

Proceedings of the 38th annual international symposium on Computer architecture
Security breaches as PMU deviation: detecting and identifying security attacks using performance counters

Proceedings of the Second Asia-Pacific Workshop on Systems
THeME: a system for testing by hardware monitoring events

Proceedings of the 2012 International Symposium on Software Testing and Analysis
Be conservative: enhancing failure diagnosis with proactive logging

OSDI'12 Proceedings of the 10th USENIX conference on Operating Systems Design and Implementation
Down to the bare metal: using processor features for binary analysis

Proceedings of the 28th Annual Computer Security Applications Conference
Production-run software failure diagnosis via hardware performance counters

Proceedings of the eighteenth international conference on Architectural support for programming languages and operating systems
On the feasibility of online malware detection with performance counters

Proceedings of the 40th Annual International Symposium on Computer Architecture
Applications of static analysis and program structure in statistical debugging

Applications of static analysis and program structure in statistical debugging

Quantified Score

Hi-index	0.00

Visualization

Abstract

Failures caused by software bugs are widespread in production runs, causing severe losses for end users. Unfortunately, diagnosing production-run failures is challenging. Existing work cannot satisfy privacy, run-time overhead, diagnosis capability, and diagnosis latency requirements all at once. This paper designs a low overhead, low latency, privacy preserving production-run failure diagnosis system based on two observations. First, short-term memory of program execution is often sufficient for failure diagnosis, as many bugs have short propagation distances. Second, maintaining a short-term memory of execution is much cheaper than maintaining a record of the whole execution. Following these observations, we first identify an existing hardware unit, Last Branch Record (LBR), that records the last few taken branches to help diagnose sequential bugs. We then propose a simple hardware extension, Last Cache-coherence Record (LCR), to record the last few cache accesses with specified coherence states and hence help diagnose concurrency bugs. Finally, we design LBRA and LCRA to automatically locate failure root causes using LBR and LCR. Our evaluation uses 31 real-world sequential and concurrency bug failures from 18 representative open-source software. The results show that with just 16 record entries, LBR and LCR enable our system to automatically locate the root causes for 27 out of 31 failures, with less than 3% run-time overhead. As our system does not rely on sampling,