Software Fault Tolerance of Distributed Programs Using Computation Slicing

  • Authors:
  • Neeraj Mittal;Vijay K. Garg

  • Affiliations:
  • -;-

  • Venue:
  • ICDCS '03 Proceedings of the 23rd International Conference on Distributed Computing Systems
  • Year:
  • 2003

Quantified Score

Hi-index 0.00

Visualization

Abstract

Writing correct distributed programs is hard. In spite ofextensive testing and debugging, software faults persist evenin commercial grade software. Many distributed systems,especially those employed in safety-critical environments,should be able to operate properly even in the presence ofsoftware faults. Monitoring the execution of a distributedsystem, and, on detecting a fault, initiating the appropriatecorrective action is an important way to tolerate such faults.This gives rise to the predicate detection problem which involvesfinding a consistent cut of a distributed computation,if it exists, that satisfies the given global predicate.Detecting a predicate in a computation is, however, anNP-complete problem. To ameliorate the associated combinatorialexplosion problem, we introduce the notion of computationslice in our earlier papers [5, 10]. Intuitively, sliceis a concise representation of those consistent cuts that satisfya certain condition. To detect a predicate, rather thansearching the state-space of the computation, it is muchmore efficient to search the state-space of the slice. Inthis paper, we provide efficient algorithms to compute theslice for several classes of predicates. Our experimentalresults demonstrate that slicing can lead to an exponentialimprovement over existing techniques in terms of time andspace.