A Fault Detection Service for Wide Area Distributed Computations

Authors:
P. Stelling;C. Lee;I. Foster;G. von Laszewski;C. Kesselman
Affiliations:
-;-;-;-;-
Venue:
HPDC '98 Proceedings of the 7th IEEE International Symposium on High Performance Distributed Computing
Year:
1998

Citing 0
Cited 46

A network performance tool for grid environments

SC '99 Proceedings of the 1999 ACM/IEEE conference on Supercomputing
CoG kits: a bridge between commodity distributed computing and high-performance grids

Proceedings of the ACM 2000 conference on Java Grande
Applying NetSolve's Network-Enabled Server

IEEE Computational Science & Engineering
Automatic Reincarnation of Deceased Plug-Ins in the HARNESS Metacomputing System

IPDPS '02 Proceedings of the 16th International Parallel and Distributed Processing Symposium
Fault Tolerant Wide-Area Parallel Computing

IPDPS '00 Proceedings of the 15 IPDPS 2000 Workshops on Parallel and Distributed Processing
SNAP: A Protocol for Negotiating Service Level Agreements and Coordinating Resource Management in Distributed Systems

JSSPP '02 Revised Papers from the 8th International Workshop on Job Scheduling Strategies for Parallel Processing
An Infrastructure for Monitoring and Management in Computational Grids

LCR '00 Selected Papers from the 5th International Workshop on Languages, Compilers, and Run-Time Systems for Scalable Computers
A Problem-Specific Fault-Tolerance Mechanism for Asynchronous, Distributed Systems

ICPP '00 Proceedings of the Proceedings of the 2000 International Conference on Parallel Processing
Faults in Grids: Why are they so bad and What can be done about it?

GRID '03 Proceedings of the 4th International Workshop on Grid Computing
Computing the performability of layered distributed systems with a management architecture

WOSP '04 Proceedings of the 4th international workshop on Software and performance
Analyzing the effectiveness of fault-management architectures in layered distributed systems

Performance Evaluation - Dependable systems and networks-performance and dependability symposium (DSN-PDS) 2002: Selected papers
Real-Time Strategy and Practice in Service Grid

COMPSAC '04 Proceedings of the 28th Annual International Computer Software and Applications Conference - Volume 01
A taxonomy of grid monitoring systems

Future Generation Computer Systems
Distributing MCell Simulations on the Grid

International Journal of High Performance Computing Applications
Fault-tolerant grid resource management infrastructure

Neural, Parallel & Scientific Computations - Special issue: Grid computing
A resource management and fault tolerance services in grid computing

Journal of Parallel and Distributed Computing - Special issue: Design and performance of networks for super-, cluster-, and grid-computing: Part II
And away we go: understanding the complexity of launching complex HPC applications

Proceedings of the second international workshop on Software engineering for high performance computing system applications
A wide-area Distribution Network for free software

ACM Transactions on Internet Technology (TOIT)
A health-check model for autonomic systems based on a pulse monitor

The Knowledge Engineering Review
Worldwide computing: Adaptive middleware and programming technology for dynamic Grid environments

Scientific Programming - Dynamic Grids and Worldwide Computing
Latency and bandwidth-minimizing failure detectors

Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007
The Internet Operating System: Middleware for Adaptive Distributed Computing

International Journal of High Performance Computing Applications
QoS management in distributed service oriented systems

PDCN'07 Proceedings of the 25th conference on Proceedings of the 25th IASTED International Multi-Conference: parallel and distributed computing and networks
An approach to grid resource selection and fault management based on ECA rules

Future Generation Computer Systems
Goal graph based performance improvement for self-adaptive modules

Proceedings of the 2nd international conference on Ubiquitous information management and communication
Grid Application Fault Diagnosis Using Wrapper Services and Machine Learning

ICSOC '07 Proceedings of the 5th international conference on Service-Oriented Computing
A group membership service for large-scale grids

Proceedings of the 6th international workshop on Middleware for grid computing
Utility-driven proactive management of availability in enterprise-scale information flows

Proceedings of the ACM/IFIP/USENIX 2006 International Conference on Middleware
Design of the notification system for failure detectors

International Journal of High Performance Computing and Networking
A taxonomy of grid monitoring systems

Future Generation Computer Systems
An adaptive task-level fault-tolerant approach to Grid

The Journal of Supercomputing
An evaluation of globus and legion software environments

ICCS'03 Proceedings of the 2003 international conference on Computational science: PartII
visPerf: monitoring tool for grid computing

ICCS'03 Proceedings of the 2003 international conference on Computational science: PartIII
A fault tolerance service for QoS in grid computing

ICCS'03 Proceedings of the 2003 international conference on Computational science: PartIII
Extending self-healing in grid environment by pulse monitoring

Multiagent and Grid Systems
Performance evaluation of fault tolerance techniques in grid computing system

Computers and Electrical Engineering
A fault avoidance strategy improving the reliability of the EGI production grid infrastructure

OPODIS'10 Proceedings of the 14th international conference on Principles of distributed systems
A security management scheme for failure detector distributed systems based on self-tuning control theory

Journal of Intelligent Manufacturing
Architecture-based fault tolerance support for grid applications

Proceedings of the joint ACM SIGSOFT conference -- QoSA and ACM SIGSOFT symposium -- ISARCS on Quality of software architectures -- QoSA and architecting critical systems -- ISARCS
Utility-driven proactive management of availability in enterprise-scale information flows

Middleware'06 Proceedings of the 7th ACM/IFIP/USENIX international conference on Middleware
Fault-tolerant dynamic job scheduling policy

ICA3PP'05 Proceedings of the 6th international conference on Algorithms and Architectures for Parallel Processing
Robust parallel job scheduling infrastructure for service-oriented grid computing systems

ICCSA'05 Proceedings of the 2005 international conference on Computational Science and Its Applications - Volume Part IV
Replication based fault tolerant job scheduling strategy for economy driven grid

The Journal of Supercomputing
On affirmative adaptive failure detection

ICA3PP'12 Proceedings of the 12th international conference on Algorithms and Architectures for Parallel Processing - Volume Part II
A Failure Detection System for Large Scale Distributed Systems

International Journal of Distributed Systems and Technologies
A SLA graph model for data services

Proceedings of the fifth international workshop on Cloud data management

Quantified Score

Hi-index	0.00

Visualization

Abstract

The potential for faults in distributed computing systems is a significant complicating factor for application developers. While a variety of techniques exist for detecting and correcting faults, the implementation of these techniques in a particular context can be difficult. Hence, we propose a fault detection service designed to be incorporated, in a modular fashion, into distributed computing systems, tools, or applications. This service uses well-known techniques based on unreliable fault detectors to detect and report component failure, while allowing the user to tradeoff timeliness of reporting against false positive rates. We describe the architecture of this service, report on experimental results that quantify its cost and accuracy, and describe its use in two applications, monitoring the status of system components of the GUSTO computational grid testbed and as part of the NetSolve network-enabled numerical solver.