A fault detection service for wide area distributed computations

Authors:
Paul Stelling;Cheryl DeMatteis;Ian Foster;Carl Kesselman;Craig Lee;Gregor von Laszewski
Affiliations:
The Aerospace Corporation, El Segundo, CA 90245-4691, USA;The Aerospace Corporation, El Segundo, CA 90245-4691, USA;Mathematics and Computer Science, Argonne National Laboratory, Argonne, IL 60439, USA;Information Sciences Institute, University of Southern California, Marina del Rey, CA 90292, USA;The Aerospace Corporation, El Segundo, CA 90245-4691, USA;Mathematics and Computer Science, Argonne National Laboratory, Argonne, IL 60439, USA
Venue:
Cluster Computing
Year:
1999

Citing 14
Cited 18

Distributed systems

Distributed systems
The process group approach to reliable distributed computing

Communications of the ACM
PVM: Parallel virtual machine: a users' guide and tutorial for networked parallel computing

PVM: Parallel virtual machine: a users' guide and tutorial for networked parallel computing
Impossibility of distributed consensus with one faulty process

Journal of the ACM (JACM)
Unreliable failure detectors for reliable distributed systems

Journal of the ACM (JACM)
Totem: a fault-tolerant multicast group communication system

Communications of the ACM
The grid: blueprint for a new computing infrastructure

The grid: blueprint for a new computing infrastructure
Scalable Networked Information Processing Environment (SNIPE)

SC '97 Proceedings of the 1997 ACM/IEEE conference on Supercomputing
A Resource Management Architecture for Metacomputing Systems

IPPS/SPDP '98 Proceedings of the Workshop on Job Scheduling Strategies for Parallel Processing
The Globus Project: A Status Report

HCW '98 Proceedings of the Seventh Heterogeneous Computing Workshop
Design and Performance of Horus: A Lightweight Group Communications System

Design and Performance of Horus: A Lightweight Group Communications System
Measurements and Analysis of End-to-End Internet Dynamics

Measurements and Analysis of End-to-End Internet Dynamics
NetSolve: A Network Server for Solving Computational Science Problems

NetSolve: A Network Server for Solving Computational Science Problems
Campus-Wide Computing: Early Results Using Legion at The University of Virginia

Campus-Wide Computing: Early Results Using Legion at The University of Virginia

Fault-tolerant grid architecture and practice

Journal of Computer Science and Technology - Grid computing
RPC-V: Toward Fault-Tolerant RPC for Internet Connected Desktop Grids with Volatile Nodes

Proceedings of the 2004 ACM/IEEE conference on Supercomputing
Evaluation of the QoS of crash-recovery failure detection

Proceedings of the 2007 ACM symposium on Applied computing
A Scalable and Efficient Self-Organizing Failure Detector for Grid Applications

GRID '05 Proceedings of the 6th IEEE/ACM International Workshop on Grid Computing
SemreX: Efficient search in a semantic overlay for literature retrieval

Future Generation Computer Systems
Temporal dimension for job submission description language

SEPADS'08 Proceedings of the 7th WSEAS International Conference on Software Engineering, Parallel and Distributed Systems
Failure Detection Service for Large Scale Systems

KES-AMSTA '07 Proceedings of the 1st KES International Symposium on Agent and Multi-Agent Systems: Technologies and Applications
Relaxed maximum a posteriori fault identification

Signal Processing
Design of the notification system for failure detectors

International Journal of High Performance Computing and Networking
Evaluating recovery aware components for grid reliability

Proceedings of the the 7th joint meeting of the European software engineering conference and the ACM SIGSOFT symposium on The foundations of software engineering
Quality-driven architecture development using architectural tactics

Journal of Systems and Software
Agent based self-healing system for grid computing

Proceedings of the International Conference and Workshop on Emerging Trends in Technology
Skip ring topology in fast failure detection service

PPAM'07 Proceedings of the 7th international conference on Parallel processing and applied mathematics
A replication structure for efficient and fault-tolerant parallel and distributed simulations

SpringSim '10 Proceedings of the 2010 Spring Simulation Multiconference
MGC middleware for grid computing: the Globus Toolkit

ACAI '11 Proceedings of the International Conference on Advances in Computing and Artificial Intelligence
Federate Fault Tolerance in HLA-Based Simulation

PADS '10 Proceedings of the 2010 IEEE Workshop on Principles of Advanced and Distributed Simulation
Development of fault tolerant grid applications using distributed b

IFM'05 Proceedings of the 5th international conference on Integrated Formal Methods
Parameterised architectural patterns for providing cloud service fault tolerance with accurate costings

Proceedings of the 16th International ACM Sigsoft symposium on Component-based software engineering

Quantified Score

Hi-index	0.00

Visualization

Abstract

The potential for faults in distributed computing systems is a significant complicating factor for application developers. While a variety of techniques exist for detecting and correcting faults, the implementation of these techniques in a particular context can be difficult. Hence, we propose a fault detection service designed to be incorporated, in a modular fashion, into distributed computing systems, tools, or applications. This service uses well-known techniques based on unreliable fault detectors to detect and report component failure, while allowing the user to trade off timeliness of reporting against false positive rates. We describe the architecture of this service, report on experimental results that quantify its cost and accuracy, and describe its use in two applications, monitoring the status of system components of the GUSTO computational grid testbed and as part of the NetSolve network-enabled numerical solver.