Fault Tolerant Wide-Area Parallel Computing

Authors:
Jon B. Weissman
Affiliations:
-
Venue:
IPDPS '00 Proceedings of the 15 IPDPS 2000 Workshops on Parallel and Distributed Processing
Year:
2000

Citing 9
Cited 9

Fault tolerance in distributed systems

Fault tolerance in distributed systems
The Legion vision of a worldwide virtual computer

Communications of the ACM
Impact of Checkpoint Latency on Overhead Ratio of a Checkpointing Scheme

IEEE Transactions on Computers
Gallop: the benefits of wide-area computing for parallel processing

Journal of Parallel and Distributed Computing
Adaptive load migration systems for PVM

Proceedings of the 1994 ACM/IEEE conference on Supercomputing
A Fault Detection Service for Wide Area Distributed Computations

HPDC '98 Proceedings of the 7th IEEE International Symposium on High Performance Distributed Computing
Process Hijacking

HPDC '99 Proceedings of the 8th IEEE International Symposium on High Performance Distributed Computing
Optimizing Parallel Applications for Wide-Area Clusters

IPPS '98 Proceedings of the 12th. International Parallel Processing Symposium on International Parallel Processing Symposium
Using Reflection for Incorporating Fault-Tolerance Techniques into Distributed Applications

Using Reflection for Incorporating Fault-Tolerance Techniques into Distributed Applications

Recovery Support for Internet-Based Real-Time Collaborative Editing Systems

ICCNMC '01 Proceedings of the 2001 International Conference on Computer Networks and Mobile Computing (ICCNMC'01)
A grid-enabled distributed branch-and-bound algorithm with application on the Steiner problem in graphs

Parallel Computing - Optimization on grids - Optimization for grids
visPerf: monitoring tool for grid computing

ICCS'03 Proceedings of the 2003 international conference on Computational science: PartIII
A fault avoidance strategy improving the reliability of the EGI production grid infrastructure

OPODIS'10 Proceedings of the 14th international conference on Principles of distributed systems
Fault-tolerant dynamic job scheduling policy

ICA3PP'05 Proceedings of the 6th international conference on Algorithms and Architectures for Parallel Processing
Robust parallel job scheduling infrastructure for service-oriented grid computing systems

ICCSA'05 Proceedings of the 2005 international conference on Computational Science and Its Applications - Volume Part IV
Improving availability in large, distributed component-based systems via redeployment

CD'05 Proceedings of the Third international working conference on Component Deployment
A decentralized redeployment algorithm for improving the availability of distributed systems

CD'05 Proceedings of the Third international working conference on Component Deployment
Quality-of-service-aware fault tolerance for grid-enabled applications

Optical Switching and Networking

Quantified Score

Hi-index	0.00

Visualization

Abstract

Executing parallel applications across distributed networks introduces the problem of fault tolerance. A viable solution for fault tolerance must keep overhead manageable and not compromise the high performance objective of parallel processing. In this paper, we explore two options for achieving fault tolerance for a common class of parallel applications, single-program-multiple-data (SPMD). We quantitatively compare checkpoint-recovery and wide-area replication as a means of achieving fault tolerance. The experimental results obtained for a canonical SPMD application suggest that checkpoint-recovery may be preferable for small problems if local parallel disks are available, but wide-area replication outperforms checkpoint-recovery for larger-grain problems, precisely the problems most suited for the wide-area network environment. The results also show that it possible to accurately model and predict the overheads of the two methods