PRES: probabilistic replay with execution sketching on multiprocessors

Authors:
Soyeon Park;Yuanyuan Zhou;Weiwei Xiong;Zuoning Yin;Rini Kaushik;Kyu H. Lee;Shan Lu
Affiliations:
University of California, San Diego, La Jolla, USA;University of California, San Diego, La Jolla, USA;University of Illinois at Urbana Champaign, Urbana, USA;University of Illinois at Urbana Champaign, Urbana, USA;University of Illinois at Urbana Champaign, Urbana, USA;Purdue University, West Lafayette, USA;University of Wisconsin - Madison , Madison, USA
Venue:
Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles
Year:
2009

Citing 36
Cited 69

Debugging Parallel Programs with Instant Replay

IEEE Transactions on Computers
Debugging concurrent processes: a case study

PLDI '88 Proceedings of the ACM SIGPLAN 1988 conference on Programming Language design and Implementation
Memory coherence in shared virtual memory systems

ACM Transactions on Computer Systems (TOCS)
An empirical comparison of monitoring algorithms for access anomaly detection

PPOPP '90 Proceedings of the second ACM SIGPLAN symposium on Principles & practice of parallel programming
The SPLASH-2 programs: characterization and methodological considerations

ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
Hypervisor-based fault tolerance

SOSP '95 Proceedings of the fifteenth ACM symposium on Operating systems principles
Online data-race detection via coherency guarantees

OSDI '96 Proceedings of the second USENIX symposium on Operating systems design and implementation
Eraser: a dynamic data race detector for multithreaded programs

ACM Transactions on Computer Systems (TOCS)
Deterministic replay of Java multithreaded applications

SPDT '98 Proceedings of the SIGMETRICS symposium on Parallel and distributed tools
A "flight data recorder" for enabling full-system multiprocessor deterministic replay

Proceedings of the 30th annual international symposium on Computer architecture
Finding stale-value errors in concurrent programs: Research Articles

Concurrency and Computation: Practice & Experience
A serializability violation detector for shared-memory server programs

Proceedings of the 2005 ACM SIGPLAN conference on Programming language design and implementation
Pin: building customized program analysis tools with dynamic instrumentation

Proceedings of the 2005 ACM SIGPLAN conference on Programming language design and implementation
BugNet: Continuously Recording Program Execution for Deterministic Replay Debugging

Proceedings of the 32nd annual international symposium on Computer Architecture
CADRE: Cycle-Accurate Deterministic Replay for Hardware Debugging

DSN '06 Proceedings of the International Conference on Dependable Systems and Networks
Automatic logging of operating system effects to guide application-level architecture simulation

SIGMETRICS '06/Performance '06 Proceedings of the joint international conference on Measurement and modeling of computer systems
A regulated transitive reduction (RTR) for longer memory race recording

Proceedings of the 12th international conference on Architectural support for programming languages and operating systems
Recording shared memory dependencies using strata

Proceedings of the 12th international conference on Architectural support for programming languages and operating systems
Debugging operating systems with time-traveling virtual machines

ATEC '05 Proceedings of the annual conference on USENIX Annual Technical Conference
Flashback: a lightweight extension for rollback and deterministic replay for software debugging

ATEC '04 Proceedings of the annual conference on USENIX Annual Technical Conference
Automatically classifying benign and harmful data races using replay analysis

Proceedings of the 2007 ACM SIGPLAN conference on Programming language design and implementation
Configuration debugging as search: finding the needle in the haystack

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Execution replay for intrusion analysis

Execution replay for intrusion analysis
Triage: diagnosing production run failures at the user's site

Proceedings of twenty-first ACM SIGOPS symposium on Operating systems principles
DejaView: a personal virtual computer recorder

Proceedings of twenty-first ACM SIGOPS symposium on Operating systems principles
Execution replay of multiprocessor virtual machines

Proceedings of the fourth ACM SIGPLAN/SIGOPS international conference on Virtual execution environments
Better bug reporting with better privacy

Proceedings of the 13th international conference on Architectural support for programming languages and operating systems
Learning from mistakes: a comprehensive study on real world concurrency bug characteristics

Proceedings of the 13th international conference on Architectural support for programming languages and operating systems
Rerun: Exploiting Episodes for Lightweight Memory Race Recording

ISCA '08 Proceedings of the 35th Annual International Symposium on Computer Architecture
DeLorean: Recording and Deterministically Replaying Shared-Memory Multiprocessor Execution Ef?ciently

ISCA '08 Proceedings of the 35th Annual International Symposium on Computer Architecture
CTrigger: exposing atomicity violation bugs from their hiding places

Proceedings of the 14th international conference on Architectural support for programming languages and operating systems
Capo: a software-hardware interface for practical deterministic multiprocessor replay

Proceedings of the 14th international conference on Architectural support for programming languages and operating systems
DMP: deterministic shared memory multiprocessing

Proceedings of the 14th international conference on Architectural support for programming languages and operating systems
Kendo: efficient deterministic multithreading in software

Proceedings of the 14th international conference on Architectural support for programming languages and operating systems
R2: an application-level kernel for record and replay

OSDI'08 Proceedings of the 8th USENIX conference on Operating systems design and implementation
Finding and reproducing Heisenbugs in concurrent programs

OSDI'08 Proceedings of the 8th USENIX conference on Operating systems design and implementation

Respec: efficient online multiprocessor replayvia speculation and external determinism

Proceedings of the fifteenth edition of ASPLOS on Architectural support for programming languages and operating systems
Analyzing multicore dumps to facilitate concurrency bug reproduction

Proceedings of the fifteenth edition of ASPLOS on Architectural support for programming languages and operating systems
ConMem: detecting severe concurrency bugs through an effect-oriented approach

Proceedings of the fifteenth edition of ASPLOS on Architectural support for programming languages and operating systems
Butterfly analysis: adapting dataflow analysis to dynamic parallel monitoring

Proceedings of the fifteenth edition of ASPLOS on Architectural support for programming languages and operating systems
Execution synthesis: a technique for automated software debugging

Proceedings of the 5th European conference on Computer systems
A trace simplification technique for effective debugging of concurrent programs

Proceedings of the eighteenth ACM SIGSOFT international symposium on Foundations of software engineering
LEAP: lightweight deterministic multi-processor replay of concurrent java programs

Proceedings of the eighteenth ACM SIGSOFT international symposium on Foundations of software engineering
Focus replay debugging effort on the control plane

HotDep'10 Proceedings of the Sixth international conference on Hot topics in system dependability
Bypassing races in live applications with execution filters

OSDI'10 Proceedings of the 9th USENIX conference on Operating systems design and implementation
Deterministic process groups in dOS

OSDI'10 Proceedings of the 9th USENIX conference on Operating systems design and implementation
Stable deterministic multithreading through schedule memoization

OSDI'10 Proceedings of the 9th USENIX conference on Operating systems design and implementation
Automating configuration troubleshooting with dynamic information flow analysis

OSDI'10 Proceedings of the 9th USENIX conference on Operating systems design and implementation
InstantCheck: Checking the Determinism of Parallel Programs Using On-the-Fly Incremental Hashing

MICRO '43 Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture
Low-overhead bug fingerprinting for fast debugging

RV'10 Proceedings of the First international conference on Runtime verification
Improving software diagnosability via log enhancement

Proceedings of the sixteenth international conference on Architectural support for programming languages and operating systems
DoublePlay: parallelizing sequential logging and replay

Proceedings of the sixteenth international conference on Architectural support for programming languages and operating systems
RCDC: a relaxed consistency deterministic computer

Proceedings of the sixteenth international conference on Architectural support for programming languages and operating systems
2ndStrike: toward manifesting hidden concurrency typestate bugs

Proceedings of the sixteenth international conference on Architectural support for programming languages and operating systems
ConSeq: detecting concurrency bugs through sequential errors

Proceedings of the sixteenth international conference on Architectural support for programming languages and operating systems
Striking a new balance between program instrumentation and debugging time

Proceedings of the sixth conference on Computer systems
Dependence-based multi-level tracing and replay for wireless sensor networks debugging

Proceedings of the 2011 SIGPLAN/SIGBED conference on Languages, compilers and tools for embedded systems
Debug determinism: the sweet spot for replay-based debugging

HotOS'13 Proceedings of the 13th USENIX conference on Hot topics in operating systems
Toward generating reducible replay logs

Proceedings of the 32nd ACM SIGPLAN conference on Programming language design and implementation
Isolating and understanding concurrency errors using reconstructed execution fragments

Proceedings of the 32nd ACM SIGPLAN conference on Programming language design and implementation
Record and transplay: partial checkpointing for replay debugging across heterogeneous systems

Proceedings of the ACM SIGMETRICS joint international conference on Measurement and modeling of computer systems
Karma: scalable deterministic record-replay

Proceedings of the international conference on Supercomputing
RADBench: a concurrency bug benchmark suite

HotPar'11 Proceedings of the 3rd USENIX conference on Hot topic in parallelism
ORDER: object centric deterministic replay for Java

USENIXATC'11 Proceedings of the 2011 USENIX conference on USENIX annual technical conference
Record and transplay: partial checkpointing for replay debugging across heterogeneous systems

ACM SIGMETRICS Performance Evaluation Review - Performance evaluation review
Partial replay of long-running applications

Proceedings of the 19th ACM SIGSOFT symposium and the 13th European conference on Foundations of software engineering
An efficient static trace simplification technique for debugging concurrent programs

SAS'11 Proceedings of the 18th international conference on Static analysis
Efficient deterministic multithreading through schedule relaxation

SOSP '11 Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles
DoublePlay: Parallelizing Sequential Logging and Replay

ACM Transactions on Computer Systems (TOCS) - Special Issue APLOS 2011
Improving Software Diagnosability via Log Enhancement

ACM Transactions on Computer Systems (TOCS) - Special Issue APLOS 2011
A lightweight and portable approach to making concurrent failures reproducible

FASE'10 Proceedings of the 13th international conference on Fundamental Approaches to Software Engineering
Execution mining

VEE '12 Proceedings of the 8th ACM SIGPLAN/SIGOPS conference on Virtual Execution Environments
CoreRacer: a practical memory race recorder for multicore x86 TSO processors

Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture
Exploiting parallelism in deterministic shared memory multiprocessing

Journal of Parallel and Distributed Computing
Can deterministic replay be an enabling tool for mobile computing?

Proceedings of the 12th Workshop on Mobile Computing Systems and Applications
Chimera: hybrid program analysis for determinism

Proceedings of the 33rd ACM SIGPLAN conference on Programming Language Design and Implementation
BugRedux: reproducing field failures for in-house debugging

Proceedings of the 34th International Conference on Software Engineering
BALLERINA: automatic generation and clustering of efficient random unit tests for multithreaded code

Proceedings of the 34th International Conference on Software Engineering
Stride: search-based deterministic replay in polynomial time via bounded linkage

Proceedings of the 34th International Conference on Software Engineering
Tracing and recording interrupts in embedded software

Journal of Systems Architecture: the EUROMICRO Journal
LEAN: simplifying concurrency bug reproduction via replay-supported execution reduction

Proceedings of the ACM international conference on Object oriented programming systems languages and applications
Efficient patch-based auditing for web application vulnerabilities

OSDI'12 Proceedings of the 10th USENIX conference on Operating Systems Design and Implementation
Automated concurrency-bug fixing

OSDI'12 Proceedings of the 10th USENIX conference on Operating Systems Design and Implementation
All about Eve: execute-verify replication for multi-core servers

OSDI'12 Proceedings of the 10th USENIX conference on Operating Systems Design and Implementation
X-ray: automating root-cause diagnosis of performance anomalies in production software

OSDI'12 Proceedings of the 10th USENIX conference on Operating Systems Design and Implementation
DTAM: dynamic taint analysis of multi-threaded programs for relevancy

Proceedings of the ACM SIGSOFT 20th International Symposium on the Foundations of Software Engineering
ConMem: Detecting Crash-Triggering Concurrency Bugs through an Effect-Oriented Approach

ACM Transactions on Software Engineering and Methodology (TOSEM)
Scalable deterministic replay in a parallel full-system emulator

Proceedings of the 18th ACM SIGPLAN symposium on Principles and practice of parallel programming
RaceFree: an efficient multi-threading model for determinism

Proceedings of the 18th ACM SIGPLAN symposium on Principles and practice of parallel programming
Deterministic Replay Using Global Clock

ACM Transactions on Architecture and Code Optimization (TACO)
Transparent mutable replay for multicore debugging and patch validation

Proceedings of the eighteenth international conference on Architectural support for programming languages and operating systems
Cyrus: unintrusive application-level record-replay for replay parallelism

Proceedings of the eighteenth international conference on Architectural support for programming languages and operating systems
CONCURRIT: a domain specific language for reproducing concurrency bugs

Proceedings of the 34th ACM SIGPLAN conference on Programming language design and implementation
CLAP: recording local executions to reproduce concurrency failures

Proceedings of the 34th ACM SIGPLAN conference on Programming language design and implementation
Automated debugging for arbitrarily long executions

HotOS'13 Proceedings of the 14th USENIX conference on Hot Topics in Operating Systems
OCTET: capturing and controlling cross-thread dependences efficiently

Proceedings of the 2013 ACM SIGPLAN international conference on Object oriented programming systems languages & applications
Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles

ACM SIGOPS 24th Symposium on Operating Systems Principles
Parrot: a practical runtime for deterministic, stable, and reliable threads

Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles
Towards effective and efficient search-based deterministic replay

Proceedings of the 9th Workshop on Hot Topics in Dependable Systems
Semi-automated debugging via binary search through a process lifetime

Proceedings of the Seventh Workshop on Programming Languages and Operating Systems
Synchronization identification through on-the-fly test

Euro-Par'13 Proceedings of the 19th international conference on Parallel Processing
RelaxReplay: record and replay for relaxed-consistency multiprocessors

Proceedings of the 19th international conference on Architectural support for programming languages and operating systems
Trace driven dynamic deadlock detection and reproduction

Proceedings of the 19th ACM SIGPLAN symposium on Principles and practice of parallel programming
Infrastructure-free logging and replay of concurrent execution on multiple cores

Proceedings of the 19th ACM SIGPLAN symposium on Principles and practice of parallel programming
Global property violation detection and diagnosis for wireless sensor networks

Proceedings of the 2013 International Conference on Compilers, Architectures and Synthesis for Embedded Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

Bug reproduction is critically important for diagnosing a production-run failure. Unfortunately, reproducing a concurrency bug on multi-processors (e.g., multi-core) is challenging. Previous techniques either incur large overhead or require new non-trivial hardware extensions. This paper proposes a novel technique called PRES (probabilistic replay via execution sketching) to help reproduce concurrency bugs on multi-processors. It relaxes the past (perhaps idealistic) objective of "reproducing the bug on the first replay attempt" to significantly lower production-run recording overhead. This is achieved by (1) recording only partial execution information (referred to as "sketches") during the production run, and (2) relying on an intelligent replayer during diagnosis time (when performance is less critical) to systematically explore the unrecorded non-deterministic space and reproduce the bug. With only partial information, our replayer may require more than one coordinated replay run to reproduce a bug. However, after a bug is reproduced once, PRES can reproduce it every time. We implemented PRES along with five different execution sketching mechanisms. We evaluated them with 11 representative applications, including 4 servers, 3 desktop/client applications, and 4 scientific/graphics applications, with 13 real-world concurrency bugs of different types, including atomicity violations, order violations and deadlocks. PRES (with synchronization or system call sketching) significantly lowered the production-run recording overhead of previous approaches (by up to 4416 times), while still reproducing most tested bugs in fewer than 10 replay attempts. Moreover, PRES scaled well with the number of processors; PRES's feedback generation from unsuccessful replays is critical in bug reproduction.