A channel memory based fault tolerance for MPI applications

  • Authors:
  • A. Selikhov;C. Germain

  • Affiliations:
  • Supercomputer Software Department, ICMMG SB RAS, pr.Lavrentieva, Novosibirsk, Russia;Université Paris-Sud, LAL (CNRS), Bâtiment, Orsay Cedex, France

  • Venue:
  • Future Generation Computer Systems - Special issue: Parallel computing technologies
  • Year:
  • 2005

Quantified Score

Hi-index 0.00

Visualization

Abstract

Fault tolerant message passing environments protect parallel applications against node failures. Very large scale computing systems, ranging from large clusters to worldwide Global Computing systems, require a high level of fault tolerance in order to efficiently run parallel applications. The Channel Memory approach provides the infrastructure for scalable tolerance to simultaneous faults. Along with a specially designed checkpointing system and recovery protocol, this approach has resulted in the MPICH-V architecture. In this paper, we describe CMDE - a stand-alone distributed program system based on MPICH-V architecture and implementing an approach to tolerate faults of Channel Memories.