On modeling and tolerating incorrect software

  • Authors:
  • Anish Arora;Marvin Theimer

  • Affiliations:
  • The Ohio State University, Columbus, OH 43214, USA E-mail: anish@cis.ohio-state.edu;Microsoft Research, Redmond, WA 98052, USA E-mail: theimer@microsoft.com

  • Venue:
  • Journal of High Speed Networks - Self-Stabilizing Systems, Part 2
  • Year:
  • 2005

Quantified Score

Hi-index 0.00

Visualization

Abstract

Distributed systems have to deal with the following scenarios in practice: bugs in components; incorrect specifications of components and, therefore, incorrect use of components; unanticipated faults due to complex interactions or to not containing the effects of faults in lower-level components; and evolution of components. Extant fault tolerance models deal with such scenarios in only a limited manner. In particular, we point out that state corruption is inevitable in practice and that therefore one must accept it and seek to correct it. The well-known concepts of detectors and correctors can be used to find and repair state corruption. However, these concepts have traditionally been employed to immediately detect and correct errors caused by misbehaving system components. Immediate detection and correction is often too expensive to perform and hence we consider the implications of running detectors and correctors only intermittently. More specifically, we address issues that must be dealt with when state corruption may persist within a system for a period of time. We show how to both detect and correct state corruption caused by infrequently occurring “transient” errors despite the ability for it to actively spread to other parts of the system. We also show how to eventually detect all state corruption, even in cases where continually recurring errors are constantly introducing new state corruption. Finally, we discuss the minimum set of capabilities needed from a trusted base of software in order to guarantee the correctness of our algorithms.