Software Fault Tolerance of Concurrent Programs Using Controlled Re-execution

  • Authors:
  • Ashis Tarafdar;Vijay K. Garg

  • Affiliations:
  • -;-

  • Venue:
  • Proceedings of the 13th International Symposium on Distributed Computing
  • Year:
  • 1999

Quantified Score

Hi-index 0.00

Visualization

Abstract

Concurrent programs often encounter failures, such as races, owing to the presence of synchronization faults (bugs). One existing technique to tolerate synchronization faults is to roll back the program to a previous state andre -execute, in the hope that the failure does not recur. Insteadof relying on chance, our approach is to control the reexecution in order to avoid a recurrence of the synchronization failure. The control is achievedb y tracing information during an execution andu sing this information to add synchronizations during the re-execution. The approach gives rise to a general problem, calledt he off-line predicate control problem, which takes a computation anda property specified on the computation, andou tputs a "controlled" computation that maintains the property. We solve the predicate control problem for the mutual exclusion property, which is especially important in synchronization fault tolerance.