Highly fault-tolerant parallel computation

  • Authors:
  • D. A. Spielman

  • Affiliations:
  • -

  • Venue:
  • FOCS '96 Proceedings of the 37th Annual Symposium on Foundations of Computer Science
  • Year:
  • 1996

Quantified Score

Hi-index 0.00

Visualization

Abstract

We re-introduce the coded model of fault-tolerant computation in which the input and output of a computational device are treated as words in an error-correcting code. A computational device correctly computes a function in the coded model if its input and output, once decoded, are a valid input and output of the function. In the coded model, it is reasonable to hope to simulate all computational devices by devices whose size is greater by a constant factor but which are exponentially reliable even if each of their components can fail with some constant probability. We consider fine-grained parallel computations in which each processor has a constant probability of producing the wrong output at each time step. We show that any parallel computation that runs for time t on w processors can be performed reliably on a faulty machine in the coded model using wlog/sup 0(1/)w processors and time tlog/sup 0(1)/w. The failure probability of the computation will be at most t/spl middot/exp(-w/sup 1/4 /). The codes used to communicate with our fault-tolerant machines are generalized Reed-Solomon codes and can thus be encoded and decoded in O(nlog/sup 0(1)/n) sequential time and are independent of the machine they are used to communicate with. We also show how coded computation can be used to self-correct many linear functions in parallel with arbitrarily small overhead.