Software Fault Tolerance of Concurrent Programs Using Controlled Re-execution

Authors:
Ashis Tarafdar;Vijay K. Garg
Affiliations:
-;-
Venue:
Proceedings of the 13th International Symposium on Distributed Computing
Year:
1999

Citing 11
Cited 2

Algorithms for mutual exclusion

Algorithms for mutual exclusion
Understanding fault-tolerant distributed systems

Communications of the ACM
Pace condition detection for debugging shared-memory parallel programs

Pace condition detection for debugging shared-memory parallel programs
Optimal tracing and replay for debugging message-passing parallel programs

Proceedings of the 1992 ACM/IEEE conference on Supercomputing
Optimal tracing and replay for debugging shared-memory parallel programs

PADD '93 Proceedings of the 1993 ACM/ONR workshop on Parallel and distributed debugging
Efficient detection of determinacy races in Cilk programs

Proceedings of the ninth annual ACM symposium on Parallel algorithms and architectures
Progressive Retry for Software Failure Recovery in Message-Passing Applications

IEEE Transactions on Computers
Deterministic replay of Java multithreaded applications

SPDT '98 Proceedings of the SIGMETRICS symposium on Parallel and distributed tools
Time, clocks, and the ordering of events in a distributed system

Communications of the ACM
Concurrent Programming in Java: Design Principles and Patterns

Concurrent Programming in Java: Design Principles and Patterns
Predicate Control for Active Debugging of Distributed Programs

IPPS '98 Proceedings of the 12th. International Parallel Processing Symposium on International Parallel Processing Symposium

Computation Slicing: Techniques and Theory

DISC '01 Proceedings of the 15th International Conference on Distributed Computing
On Slicing a Distributed Computation

ICDCS '01 Proceedings of the The 21st International Conference on Distributed Computing Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

Concurrent programs often encounter failures, such as races, owing to the presence of synchronization faults (bugs). One existing technique to tolerate synchronization faults is to roll back the program to a previous state andre -execute, in the hope that the failure does not recur. Insteadof relying on chance, our approach is to control the reexecution in order to avoid a recurrence of the synchronization failure. The control is achievedb y tracing information during an execution andu sing this information to add synchronizations during the re-execution. The approach gives rise to a general problem, calledt he off-line predicate control problem, which takes a computation anda property specified on the computation, andou tputs a "controlled" computation that maintains the property. We solve the predicate control problem for the mutual exclusion property, which is especially important in synchronization fault tolerance.