Leveraging the short-term memory of hardware to diagnose production-run software failures

  • Authors:
  • Joy Arulraj;Guoliang Jin;Shan Lu

  • Affiliations:
  • University of Wisconsin, Madison, WI, USA;University of Wisconsin, Madison, WI, USA;University of Wisconsin, Madison, WI, USA

  • Venue:
  • Proceedings of the 19th international conference on Architectural support for programming languages and operating systems
  • Year:
  • 2014

Quantified Score

Hi-index 0.00

Visualization

Abstract

Failures caused by software bugs are widespread in production runs, causing severe losses for end users. Unfortunately, diagnosing production-run failures is challenging. Existing work cannot satisfy privacy, run-time overhead, diagnosis capability, and diagnosis latency requirements all at once. This paper designs a low overhead, low latency, privacy preserving production-run failure diagnosis system based on two observations. First, short-term memory of program execution is often sufficient for failure diagnosis, as many bugs have short propagation distances. Second, maintaining a short-term memory of execution is much cheaper than maintaining a record of the whole execution. Following these observations, we first identify an existing hardware unit, Last Branch Record (LBR), that records the last few taken branches to help diagnose sequential bugs. We then propose a simple hardware extension, Last Cache-coherence Record (LCR), to record the last few cache accesses with specified coherence states and hence help diagnose concurrency bugs. Finally, we design LBRA and LCRA to automatically locate failure root causes using LBR and LCR. Our evaluation uses 31 real-world sequential and concurrency bug failures from 18 representative open-source software. The results show that with just 16 record entries, LBR and LCR enable our system to automatically locate the root causes for 27 out of 31 failures, with less than 3% run-time overhead. As our system does not rely on sampling,