A case for redundant arrays of inexpensive disks (RAID)
SIGMOD '88 Proceedings of the 1988 ACM SIGMOD international conference on Management of data
On implementing MPI-IO portably and with high performance
Proceedings of the sixth workshop on I/O in parallel and distributed systems
CCGRID '03 Proceedings of the 3st International Symposium on Cluster Computing and the Grid
RAID-x: A New Distributed Disk Array for I/O-Centric Cluster Computing
HPDC '00 Proceedings of the 9th IEEE International Symposium on High Performance Distributed Computing
A Fault Tolerant MPI-IO Implementation using the Expand Parallel File System
PDP '05 Proceedings of the 13th Euromicro Conference on Parallel, Distributed and Network-Based Processing
Hi-index | 0.00 |
Parallelism in file systems is obtained by using several independent server nodes supporting one or more secondary storage devices. This approach increases the performance and scalability of the system, but a fault in one single node can make the whole system fail. In order to avoid this problem, data must be stored using some kind of redundant technique, so that it can be recovered in case of failure. Fault tolerance can be provided in I/O systems by using replication or RAID based schemes. However, most of the current systems apply the same technique of fault tolerant at disk or file system level. This paper1 describes how fault tolerance support can be used by MPI applications based on PVFS version 2 [1], a well-know parallel file system for clusters. This support can be applied to other parallel file systems with many benefits: fault tolerance at file level, flexible definition of new fault tolerance scheme, and dynamic reconfiguration of the fault tolerance policy.