A study of application-level recovery methods for transient network faults
ScalA '13 Proceedings of the Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems
Hi-index | 0.00 |
This paper introduces a new method based on parallel failure recovery, for the fault tolerance issue of parallel programs. In case a process fails, other surviving processes will compute the task of the failed one in parallel, so that the overhead for fault tolerance is leveled down. The paper presents the design and implementation of the parallel FFT using the new approach, and works on finding an optimum number of processes that participate in parallel failure recovery. Finally, an experiment is done to show the better performance of the parallel failure recovery over that of checkpointing, and to show the effectiveness of our solution for the best number of processes participating parallel failure recovery.