Finding missing synchronization in a distributed computation using controlled re-execution

Authors:
Neeraj Mittal;Vijay K. Garg
Affiliations:
Department of Computer Science, The University of Texas at Dallas, Richardson, TX;Department of Electrical and Computer Engineering, The University of Texas at Austin, Austin, TX
Venue:
Distributed Computing
Year:
2004

Citing 16
Cited 3

Debugging Parallel Programs with Instant Replay

IEEE Transactions on Computers
Recovery in distributed systems using asynchronous message logging and checkpointing

PODC '88 Proceedings of the seventh annual ACM Symposium on Principles of distributed computing
Introduction to algorithms

Introduction to algorithms
Logical Time in Distributed Computing Systems

Computer - Distributed computing systems: separate resources acting as one
Consistent detection of global predicates

PADD '91 Proceedings of the 1991 ACM/ONR workshop on Parallel and distributed debugging
Distributed snapshots: determining global states of distributed systems

ACM Transactions on Computer Systems (TOCS)
Progressive Retry for Software Failure Recovery in Message-Passing Applications

IEEE Transactions on Computers
Debugging distributed programs using controlled re-execution

Proceedings of the nineteenth annual ACM symposium on Principles of distributed computing
Time, clocks, and the ordering of events in a distributed system

Communications of the ACM
Detecting Temporal Logic Predicates on the Happened-Before Model

IPDPS '02 Proceedings of the 16th International Parallel and Distributed Processing Symposium
Efficient Symbolic Detection of Global Properties in Distributed Systems

CAV '98 Proceedings of the 10th International Conference on Computer Aided Verification
Efficient Detection of Global Properties in Distributed Systems Using Partial-Order Methods

CAV '00 Proceedings of the 12th International Conference on Computer Aided Verification
On-the-Fly Detection of Conjunctions of Local Predicates in Distributed Computations

SPDP '96 Proceedings of the 8th IEEE Symposium on Parallel and Distributed Processing (SPDP '96)
Predicate Control for Active Debugging of Distributed Programs

IPPS '98 Proceedings of the 12th. International Parallel Processing Symposium on International Parallel Processing Symposium
Software fault tolerance in distributed systems using controlled re-execution

Software fault tolerance in distributed systems using controlled re-execution
Detection of global predicates: techniques and their limitations

Distributed Computing

Techniques and applications of computation slicing

Distributed Computing
Healing data races on-the-fly

Proceedings of the 2007 ACM workshop on Parallel and distributed systems: testing and debugging
Cross-Entropy-Based Replay of Concurrent Programs

FASE '09 Proceedings of the 12th International Conference on Fundamental Approaches to Software Engineering: Held as Part of the Joint European Conferences on Theory and Practice of Software, ETAPS 2009

Quantified Score

Hi-index	0.00

Visualization

Abstract

Correct distributed programs are hard to write. Not surprisingly, distributed systems are especially vulnerable to software faults. Testing and debugging is an important way to improve the reliability of distributed systems. A distributed debugger equipped with the mechanism to re-execute the traced computation in a controlled fashion can greatly facilitate the detection and localization of bugs. This approach gives rise to a general problem of predicate control, which takes a computation and a safety property specified on the computation as inputs, and produces a controlled computation, with added synchronization, that maintains the given safety property as output. We devise efficient control algorithms for two classes of useful predicates, namely region predicates and disjunctive predicates. For the former, we prove that the control algorithm is optimal in the sense that it guarantees maximum concurrency possible in the controlled computation. For the latter, we prove that our control algorithm generates the least number of synchronization dependencies and therefore has optimal message-complexity. Furthermore, we provide a necessary and sufficient condition under which it is possible to efficiently compute a minimal controlling synchronization for a general predicate. We also give an algorithm to compute such a synchronization under the condition provided.