Incorporating Fault Tolerance with Replication on Very Large Scale Grids

  • Authors:
  • Elankovan Sundararajan;Aaron Harwood;Ramamohanarao Kotagiri

  • Affiliations:
  • -;-;-

  • Venue:
  • PDCAT '07 Proceedings of the Eighth International Conference on Parallel and Distributed Computing, Applications and Technologies
  • Year:
  • 2007

Quantified Score

Hi-index 0.00

Visualization

Abstract

Providing fault tolerance for message passing parallel application on a distributed environment is a rule rather than an exception. A node failure can cause the whole computation to stop and has to be restarted from the begin- ning if no fault tolerance is available. However, introducing fault tolerance has some overhead on speedup that can be achieved. In this paper, we introduce a new technique called replication with cross-over packets for reliability and to in- crease fault tolerance over Very Large Scale Grids (VLSG). This technique has two pronged effect of avoiding single point of failure and single link of failure. We incorporate this new technique into the L-BSP model and show the pos- sible speedup of parallel process. We also derive the achiev- able speedup for some fundamental parallel algorithms us- ing this technique.