The Journal of Supercomputing
Handling Persistent States in Process Checkpoint/Restart Mechanisms for HPC Systems
CCGRID '09 Proceedings of the 2009 9th IEEE/ACM International Symposium on Cluster Computing and the Grid
Fault tolerant file models for MPI-IO parallel file systems
PVM/MPI'07 Proceedings of the 14th European conference on Recent Advances in Parallel Virtual Machine and Message Passing Interface
Hi-index | 0.00 |
Parallelism in file systems is obtained by using several independent server nodes supporting one or more secondary storage devices. This approach increases the performance and scalability of the system, but a fault in one single node can stop the whole system. To avoid this problem, data must be stored using some kind of redundant technique, so any data stored in a faulty element can be recovered. Fault tolerance can be provided in I/O systems using replication or RAID based schemes. However, most of the current systems apply the same technique for all files in the system. This paper describes the fault tolerance support provided by Expand, a parallel file system based on standard servers. Expand allows to define different fault-tolerant mechanisms at file level. The evaluation compare the performance of Expand with different configurations with PVFS using the FLASH-I/O benchmark.