Fault Tolerant Matrix Operations for Networks of Workstations Using Multiple Checkpointing

  • Authors:
  • Youngbae Kim;James S. Plank;Jack J. Dongarra

  • Affiliations:
  • -;-;-

  • Venue:
  • HPC-ASIA '97 Proceedings of the High-Performance Computing on the Information Superhighway, HPC-Asia '97
  • Year:
  • 1997

Quantified Score

Hi-index 0.00

Visualization

Abstract

Recently, an algorithm-based approach using diskless checkpointing has been developed to provide fault tolerance for high-performance matrix operations. With this approach, fault tolerance is incorporated into the matrix operations, making them resilient to any single process failure with low overhead. In this paper, we present a technique called multiple checkpointing that enables the matrix operations to tolerate a certain set of multiple processor failures by adding multiple checkpointing processors. Results of implementing this technique on a network of workstations show improvement in both the reliability of the computation and the performance of checkpointing.