Fault tolerance in distributed systems
Fault tolerance in distributed systems
The Legion vision of a worldwide virtual computer
Communications of the ACM
Impact of Checkpoint Latency on Overhead Ratio of a Checkpointing Scheme
IEEE Transactions on Computers
Gallop: the benefits of wide-area computing for parallel processing
Journal of Parallel and Distributed Computing
Adaptive load migration systems for PVM
Proceedings of the 1994 ACM/IEEE conference on Supercomputing
A Fault Detection Service for Wide Area Distributed Computations
HPDC '98 Proceedings of the 7th IEEE International Symposium on High Performance Distributed Computing
HPDC '99 Proceedings of the 8th IEEE International Symposium on High Performance Distributed Computing
Optimizing Parallel Applications for Wide-Area Clusters
IPPS '98 Proceedings of the 12th. International Parallel Processing Symposium on International Parallel Processing Symposium
Using Reflection for Incorporating Fault-Tolerance Techniques into Distributed Applications
Using Reflection for Incorporating Fault-Tolerance Techniques into Distributed Applications
Recovery Support for Internet-Based Real-Time Collaborative Editing Systems
ICCNMC '01 Proceedings of the 2001 International Conference on Computer Networks and Mobile Computing (ICCNMC'01)
Parallel Computing - Optimization on grids - Optimization for grids
visPerf: monitoring tool for grid computing
ICCS'03 Proceedings of the 2003 international conference on Computational science: PartIII
A fault avoidance strategy improving the reliability of the EGI production grid infrastructure
OPODIS'10 Proceedings of the 14th international conference on Principles of distributed systems
Fault-tolerant dynamic job scheduling policy
ICA3PP'05 Proceedings of the 6th international conference on Algorithms and Architectures for Parallel Processing
Robust parallel job scheduling infrastructure for service-oriented grid computing systems
ICCSA'05 Proceedings of the 2005 international conference on Computational Science and Its Applications - Volume Part IV
Improving availability in large, distributed component-based systems via redeployment
CD'05 Proceedings of the Third international working conference on Component Deployment
A decentralized redeployment algorithm for improving the availability of distributed systems
CD'05 Proceedings of the Third international working conference on Component Deployment
Quality-of-service-aware fault tolerance for grid-enabled applications
Optical Switching and Networking
Hi-index | 0.00 |
Executing parallel applications across distributed networks introduces the problem of fault tolerance. A viable solution for fault tolerance must keep overhead manageable and not compromise the high performance objective of parallel processing. In this paper, we explore two options for achieving fault tolerance for a common class of parallel applications, single-program-multiple-data (SPMD). We quantitatively compare checkpoint-recovery and wide-area replication as a means of achieving fault tolerance. The experimental results obtained for a canonical SPMD application suggest that checkpoint-recovery may be preferable for small problems if local parallel disks are available, but wide-area replication outperforms checkpoint-recovery for larger-grain problems, precisely the problems most suited for the wide-area network environment. The results also show that it possible to accurately model and predict the overheads of the two methods