Detecting and surviving data races using complementary schedules

Authors:
Kaushik Veeraraghavan;Peter M. Chen;Jason Flinn;Satish Narayanasamy
Affiliations:
University of Michigan;University of Michigan;University of Michigan;University of Michigan
Venue:
SOSP '11 Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles
Year:
2011

Citing 49
Cited 8

On-the-fly detection of access anomalies

PLDI '89 Proceedings of the ACM SIGPLAN 1989 Conference on Programming language design and implementation
Implementing fault-tolerant services using the state machine approach: a tutorial

ACM Computing Surveys (CSUR)
An Investigation of the Therac-25 Accidents

Computer
Model checking for programming languages using VeriSoft

Proceedings of the 24th ACM SIGPLAN-SIGACT symposium on Principles of programming languages
Eraser: a dynamic data race detector for multithreaded programs

ACM Transactions on Computer Systems (TOCS)
The primary-backup approach

Distributed systems (2nd Ed.)
RecPlay: a fully integrated practical record/replay system

ACM Transactions on Computer Systems (TOCS)
Time, clocks, and the ordering of events in a distributed system

Communications of the ACM
Efficient and precise datarace detection for multithreaded object-oriented programs

PLDI '02 Proceedings of the ACM SIGPLAN 2002 Conference on Programming language design and implementation
Ownership types for safe programming: preventing data races and deadlocks

OOPSLA '02 Proceedings of the 17th ACM SIGPLAN conference on Object-oriented programming, systems, languages, and applications
Efficient on-the-fly data race detection in multithreaded C++ programs

Proceedings of the ninth ACM SIGPLAN symposium on Principles and practice of parallel programming
Building Diverse Computer Systems

HOTOS '97 Proceedings of the 6th Workshop on Hot Topics in Operating Systems (HotOS-VI)
ReEnact: using thread-level speculation mechanisms to debug data races in multithreaded codes

Proceedings of the 30th annual international symposium on Computer architecture
Software Rejuvenation: Analysis, Module and Applications

FTCS '95 Proceedings of the Twenty-Fifth International Symposium on Fault-Tolerant Computing
RacerX: effective, static detection of race conditions and deadlocks

SOSP '03 Proceedings of the nineteenth ACM symposium on Operating systems principles
Countering code-injection attacks with instruction-set randomization

Proceedings of the 10th ACM conference on Computer and communications security
Randomized instruction set emulation to disrupt binary code injection attacks

Proceedings of the 10th ACM conference on Computer and communications security
A serializability violation detector for shared-memory server programs

Proceedings of the 2005 ACM SIGPLAN conference on Programming language design and implementation
Speculative execution in a distributed file system

Proceedings of the twentieth ACM symposium on Operating systems principles
RaceTrack: efficient detection of data race conditions via adaptive tracking

Proceedings of the twentieth ACM symposium on Operating systems principles
Rx: treating bugs as allergies---a safe method to survive software failures

Proceedings of the twentieth ACM symposium on Operating systems principles
DieHard: probabilistic memory safety for unsafe languages

Proceedings of the 2006 ACM SIGPLAN conference on Programming language design and implementation
Accurate and efficient filtering for the Intel thread checker race detector

Proceedings of the 1st workshop on Architectural and system support for improving software dependability
Automatically classifying benign and harmful data races using replay analysis

Proceedings of the 2007 ACM SIGPLAN conference on Programming language design and implementation
Exploring failure transparency and the limits of generic recovery

OSDI'00 Proceedings of the 4th conference on Symposium on Operating System Design & Implementation - Volume 4
Microreboot — A technique for cheap recovery

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
N-variant systems: a secretless framework for security through diversity

USENIX-SS'06 Proceedings of the 15th conference on USENIX Security Symposium - Volume 15
The N-Version Approach to Fault-Tolerant Software

IEEE Transactions on Software Engineering
Learning from mistakes: a comprehensive study on real world concurrency bug characteristics

Proceedings of the 13th international conference on Architectural support for programming languages and operating systems
Orchestra: intrusion detection using parallel execution and monitoring of program variants in user-space

Proceedings of the 4th ACM European conference on Computer systems
FastTrack: efficient and precise dynamic race detection

Proceedings of the 2009 ACM SIGPLAN conference on Programming language design and implementation
LiteRace: effective sampling for lightweight data-race detection

Proceedings of the 2009 ACM SIGPLAN conference on Programming language design and implementation
A case for an interleaving constrained shared-memory multi-processor

Proceedings of the 36th annual international symposium on Computer architecture
Grace: safe multithreaded programming for C/C++

Proceedings of the 24th ACM SIGPLAN conference on Object oriented programming systems languages and applications
A type and effect system for deterministic parallel Java

Proceedings of the 24th ACM SIGPLAN conference on Object oriented programming systems languages and applications
The use of triple-modular redundancy to improve computer reliability

IBM Journal of Research and Development
Respec: efficient online multiprocessor replayvia speculation and external determinism

Proceedings of the fifteenth edition of ASPLOS on Architectural support for programming languages and operating systems
Orthrus: efficient software integrity protection on multi-cores

Proceedings of the fifteenth edition of ASPLOS on Architectural support for programming languages and operating systems
ThreadSanitizer: data race detection in practice

Proceedings of the Workshop on Binary Instrumentation and Applications
PACER: proportional detection of data races

PLDI '10 Proceedings of the 2010 ACM SIGPLAN conference on Programming language design and implementation
Finding and reproducing Heisenbugs in concurrent programs

OSDI'08 Proceedings of the 8th USENIX conference on Operating systems design and implementation
Gadara: dynamic deadlock avoidance for multithreaded programs

OSDI'08 Proceedings of the 8th USENIX conference on Operating systems design and implementation
Deadlock immunity: enabling systems to defend against deadlocks

OSDI'08 Proceedings of the 8th USENIX conference on Operating systems design and implementation
Bypassing races in live applications with execution filters

OSDI'10 Proceedings of the 9th USENIX conference on Operating systems design and implementation
Effective data-race detection for the kernel

OSDI'10 Proceedings of the 9th USENIX conference on Operating systems design and implementation
Stable deterministic multithreading through schedule memoization

OSDI'10 Proceedings of the 9th USENIX conference on Operating systems design and implementation
DoublePlay: parallelizing sequential logging and replay

Proceedings of the sixteenth international conference on Architectural support for programming languages and operating systems
Finding complex concurrency bugs in large multi-threaded applications

Proceedings of the sixth conference on Computer systems
Tightlip: keeping applications from spilling the beans

NSDI'07 Proceedings of the 4th USENIX conference on Networked systems design & implementation

DoublePlay: Parallelizing Sequential Logging and Replay

ACM Transactions on Computer Systems (TOCS) - Special Issue APLOS 2011
Data races vs. data race bugs: telling the difference with portend

ASPLOS XVII Proceedings of the seventeenth international conference on Architectural Support for Programming Languages and Operating Systems
Execution privatization for scheduler-oblivious concurrent programs

Proceedings of the ACM international conference on Object oriented programming systems languages and applications
Automated concurrency-bug fixing

OSDI'12 Proceedings of the 10th USENIX conference on Operating Systems Design and Implementation
X-ray: automating root-cause diagnosis of performance anomalies in production software

OSDI'12 Proceedings of the 10th USENIX conference on Operating Systems Design and Implementation
Parallelizing data race detection

Proceedings of the eighteenth international conference on Architectural support for programming languages and operating systems
ConAir: featherweight concurrency bug recovery via single-threaded idempotent execution

Proceedings of the eighteenth international conference on Architectural support for programming languages and operating systems
Safe software updates via multi-version execution

Proceedings of the 2013 International Conference on Software Engineering

Quantified Score

Hi-index	0.00

Visualization

Abstract

Data races are a common source of errors in multithreaded programs. In this paper, we show how to protect a program from data race errors at runtime by executing multiple replicas of the program with complementary thread schedules. Complementary schedules are a set of replica thread schedules crafted to ensure that replicas diverge only if a data race occurs and to make it very likely that harmful data races cause divergences. Our system, called Frost, uses complementary schedules to cause at least one replica to avoid the order of racing instructions that leads to incorrect program execution for most harmful data races. Frost introduces outcome-based race detection, which detects data races by comparing the state of replicas executing complementary schedules. We show that this method is substantially faster than existing dynamic race detectors for unmanaged code. To help programs survive bugs in production, Frost also diagnoses the data race bug and selects an appropriate recovery strategy, such as choosing a replica that is likely to be correct or executing more replicas to gather additional information. Frost controls the thread schedules of replicas by running all threads of a replica non-preemptively on a single core. To scale the program to multiple cores, Frost runs a third replica in parallel to generate checkpoints of the program's likely future states --- these checkpoints let Frost divide program execution into multiple epochs, which it then runs in parallel. We evaluate Frost using 11 real data race bugs in desktop and server applications. Frost both detects and survives all of these data races. Since Frost runs three replicas, its utilization cost is 3x. However, if there are spare cores to absorb this increased utilization, Frost adds only 3--12% overhead to application runtime.