Software Implemented Fault Tolerance in Hypercube

  • Authors:
  • Dimiter R. Avresky;S. Geoghegan

  • Affiliations:
  • -;-

  • Venue:
  • Euro-Par '99 Proceedings of the 5th International Euro-Par Conference on Parallel Processing
  • Year:
  • 1999

Quantified Score

Hi-index 0.00

Visualization

Abstract

This paper presents Software Implemented Fault Tolerance (SIFT) for hypercubes which is implemented by means of a software layer. It is written in each node of the nCube parallel computer. The SIFT utilizes an error detection application software and fast reconfiguration algorithm for avoiding faulty nodes. The Balance Spanning Tree (BST) is used for embedding tree-based algorithms into the hypercube topology. Any single faulty node in the hypercube can be tolerated by the software layer. More than 90% of the multiple faults can be tolerated without backtracking. The SIFT approach has been successfully implemented for a quadtree data compression algorithm for 64脳64, 128脳128 compressible and uncompressible data. The experiments were run on 4 and 16 node nCubes. The time overhead (reconfiguration and recomputation time) incurred by the injected fault was experimentally estimated. The coverage factor, provided by the error-detection software, has been estimated by means of nSOFIT for the quadtree data compression algorithm.