PRES: probabilistic replay with execution sketching on multiprocessors

  • Authors:
  • Soyeon Park;Yuanyuan Zhou;Weiwei Xiong;Zuoning Yin;Rini Kaushik;Kyu H. Lee;Shan Lu

  • Affiliations:
  • University of California, San Diego, La Jolla, USA;University of California, San Diego, La Jolla, USA;University of Illinois at Urbana Champaign, Urbana, USA;University of Illinois at Urbana Champaign, Urbana, USA;University of Illinois at Urbana Champaign, Urbana, USA;Purdue University, West Lafayette, USA;University of Wisconsin - Madison , Madison, USA

  • Venue:
  • Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles
  • Year:
  • 2009

Quantified Score

Hi-index 0.00

Visualization

Abstract

Bug reproduction is critically important for diagnosing a production-run failure. Unfortunately, reproducing a concurrency bug on multi-processors (e.g., multi-core) is challenging. Previous techniques either incur large overhead or require new non-trivial hardware extensions. This paper proposes a novel technique called PRES (probabilistic replay via execution sketching) to help reproduce concurrency bugs on multi-processors. It relaxes the past (perhaps idealistic) objective of "reproducing the bug on the first replay attempt" to significantly lower production-run recording overhead. This is achieved by (1) recording only partial execution information (referred to as "sketches") during the production run, and (2) relying on an intelligent replayer during diagnosis time (when performance is less critical) to systematically explore the unrecorded non-deterministic space and reproduce the bug. With only partial information, our replayer may require more than one coordinated replay run to reproduce a bug. However, after a bug is reproduced once, PRES can reproduce it every time. We implemented PRES along with five different execution sketching mechanisms. We evaluated them with 11 representative applications, including 4 servers, 3 desktop/client applications, and 4 scientific/graphics applications, with 13 real-world concurrency bugs of different types, including atomicity violations, order violations and deadlocks. PRES (with synchronization or system call sketching) significantly lowered the production-run recording overhead of previous approaches (by up to 4416 times), while still reproducing most tested bugs in fewer than 10 replay attempts. Moreover, PRES scaled well with the number of processors; PRES's feedback generation from unsuccessful replays is critical in bug reproduction.