A Problem-Specific Fault-Tolerance Mechanism for Asynchronous, Distributed Systems

Authors:
Adriana Iamnitchi;Ian Foster
Affiliations:
-;-
Venue:
ICPP '00 Proceedings of the Proceedings of the 2000 International Conference on Parallel Processing
Year:
2000

Citing 19
Cited 12

DIB—a distributed implementation of backtracking

ACM Transactions on Programming Languages and Systems (TOPLAS)
Epidemic algorithms for replicated database maintenance

ACM SIGOPS Operating Systems Review
Understanding fault-tolerant distributed systems

Communications of the ACM
Scalable load balancing techniques for parallel computers

Journal of Parallel and Distributed Computing
Impossibility of distributed consensus with one faulty process

Journal of the ACM (JACM)
On the impossibility of group membership

PODC '96 Proceedings of the fifteenth annual ACM symposium on Principles of distributed computing
Broadcast and gossip in line-communication mode

Discrete Applied Mathematics
The grid: blueprint for a new computing infrastructure

The grid: blueprint for a new computing infrastructure
What good are models and what models are good?

Distributed systems (2nd Ed.)
The Timed Asynchronous Distributed System Model

IEEE Transactions on Parallel and Distributed Systems
Fundamentals of fault-tolerant distributed computing in asynchronous environments

ACM Computing Surveys (CSUR)
Managing Checkpoints for Parallel Programs

IPPS '96 Proceedings of the Workshop on Job Scheduling Strategies for Parallel Processing
A Fault Detection Service for Wide Area Distributed Computations

HPDC '98 Proceedings of the 7th IEEE International Symposium on High Performance Distributed Computing
Process Hijacking

HPDC '99 Proceedings of the 8th IEEE International Symposium on High Performance Distributed Computing
ISIS: A System for Fault-Tolerant Distributed Computing

ISIS: A System for Fault-Tolerant Distributed Computing
Design and Performance of Horus: A Lightweight Group Communications System

Design and Performance of Horus: A Lightweight Group Communications System
A Gossip-Style Failure Detection Service

A Gossip-Style Failure Detection Service
Using Reflection for Incorporating Fault-Tolerance Techniques into Distributed Applications

Using Reflection for Incorporating Fault-Tolerance Techniques into Distributed Applications
GROUP MEMBERSHIP IN THE EPIDEMIC STYLE

GROUP MEMBERSHIP IN THE EPIDEMIC STYLE

Fault Tolerance for Cluster Computing Based on Functional Tasks

Euro-Par '01 Proceedings of the 7th International Euro-Par Conference Manchester on Parallel Processing
Fast Branch & Bound Algorithms for Optimal Feature Selection

IEEE Transactions on Pattern Analysis and Machine Intelligence
A resource management and fault tolerance services in grid computing

Journal of Parallel and Distributed Computing - Special issue: Design and performance of networks for super-, cluster-, and grid-computing: Part II
A grid-enabled distributed branch-and-bound algorithm with application on the Steiner problem in graphs

Parallel Computing - Optimization on grids - Optimization for grids
An efficient load balancing strategy for grid-based branch and bound algorithm

Parallel Computing
An approach to grid resource selection and fault management based on ECA rules

Future Generation Computer Systems
On the design of communication-aware fault-tolerant scheduling algorithms for precedence constrained tasks in grid computing systems with dedicated communication devices

Journal of Parallel and Distributed Computing
P2P computing for large tree exploration-based exact optimisation

International Journal of Grid and Utility Computing
Grid'BnB: a parallel branch and bound framework for grids

HiPC'07 Proceedings of the 14th international conference on High performance computing
Dynamic parallelization of grid–enabled web services

EGC'05 Proceedings of the 2005 European conference on Advances in Grid Computing
A combined hyperdatabase and grid infrastructure for data stream management and digital library processes

DELOS'04 Proceedings of the 6th Thematic conference on Peer-to-Peer, Grid, and Service-Orientation in Digital Library Architectures
Hierarchical branch and bound algorithm for computational grids

Future Generation Computer Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

The idle computers on a local area, campus area, or even wide area network represent a significant computational resource-one that is, however, also unreliable, heterogeneous, and opportunistic. We describe an algorithm that allows branch-and-bound problems to be solved in such environments. In designing this algorithm, we faced two challenges: (1) scalability, to effectively exploit the variably sized pools of resources available, and (2) fault tolerance, to ensure the reliability of services. We achieve scalability through a fully decentralized algorithm, in which the dynamically available resources are managed through a membership protocol. We guarantee fault tolerance in the sense that the loss of up to all but one resource will not affect the quality of the solution. For propagating information reliably, we use epidemic communication for both the membership protocol and the fault-tolerance mechanism. We have developed a simulation framework that allows us to evaluate design alternatives. Results obtained in this framework suggest that our techniques can execute scalably and reliably.