Fault tolerant file models for MPI-IO parallel file systems

  • Authors:
  • A. Calderón;F. García-Carballeira;Florin Isaila;Rainer Keller;Alexander Schulz

  • Affiliations:
  • Computer Architecture Group, Computer Science Department, Universidad Carlos III de Madrid, Leganés, Madrid, Spain;Computer Architecture Group, Computer Science Department, Universidad Carlos III de Madrid, Leganés, Madrid, Spain;Computer Architecture Group, Computer Science Department, Universidad Carlos III de Madrid, Leganés, Madrid, Spain;High Performance Computing Center Stuttgart, Universität Stuttgart, Stuttgart, Germany;High Performance Computing Center Stuttgart, Universität Stuttgart, Stuttgart, Germany

  • Venue:
  • PVM/MPI'07 Proceedings of the 14th European conference on Recent Advances in Parallel Virtual Machine and Message Passing Interface
  • Year:
  • 2007

Quantified Score

Hi-index 0.00

Visualization

Abstract

Parallelism in file systems is obtained by using several independent server nodes supporting one or more secondary storage devices. This approach increases the performance and scalability of the system, but a fault in one single node can make the whole system fail. In order to avoid this problem, data must be stored using some kind of redundant technique, so that it can be recovered in case of failure. Fault tolerance can be provided in I/O systems by using replication or RAID based schemes. However, most of the current systems apply the same technique of fault tolerant at disk or file system level. This paper1 describes how fault tolerance support can be used by MPI applications based on PVFS version 2 [1], a well-know parallel file system for clusters. This support can be applied to other parallel file systems with many benefits: fault tolerance at file level, flexible definition of new fault tolerance scheme, and dynamic reconfiguration of the fault tolerance policy.