Distributed systems
The process group approach to reliable distributed computing
Communications of the ACM
PVM: Parallel virtual machine: a users' guide and tutorial for networked parallel computing
PVM: Parallel virtual machine: a users' guide and tutorial for networked parallel computing
Impossibility of distributed consensus with one faulty process
Journal of the ACM (JACM)
Unreliable failure detectors for reliable distributed systems
Journal of the ACM (JACM)
Totem: a fault-tolerant multicast group communication system
Communications of the ACM
The grid: blueprint for a new computing infrastructure
The grid: blueprint for a new computing infrastructure
Scalable Networked Information Processing Environment (SNIPE)
SC '97 Proceedings of the 1997 ACM/IEEE conference on Supercomputing
A Resource Management Architecture for Metacomputing Systems
IPPS/SPDP '98 Proceedings of the Workshop on Job Scheduling Strategies for Parallel Processing
The Globus Project: A Status Report
HCW '98 Proceedings of the Seventh Heterogeneous Computing Workshop
Design and Performance of Horus: A Lightweight Group Communications System
Design and Performance of Horus: A Lightweight Group Communications System
Measurements and Analysis of End-to-End Internet Dynamics
Measurements and Analysis of End-to-End Internet Dynamics
NetSolve: A Network Server for Solving Computational Science Problems
NetSolve: A Network Server for Solving Computational Science Problems
Campus-Wide Computing: Early Results Using Legion at The University of Virginia
Campus-Wide Computing: Early Results Using Legion at The University of Virginia
Fault-tolerant grid architecture and practice
Journal of Computer Science and Technology - Grid computing
RPC-V: Toward Fault-Tolerant RPC for Internet Connected Desktop Grids with Volatile Nodes
Proceedings of the 2004 ACM/IEEE conference on Supercomputing
Evaluation of the QoS of crash-recovery failure detection
Proceedings of the 2007 ACM symposium on Applied computing
A Scalable and Efficient Self-Organizing Failure Detector for Grid Applications
GRID '05 Proceedings of the 6th IEEE/ACM International Workshop on Grid Computing
SemreX: Efficient search in a semantic overlay for literature retrieval
Future Generation Computer Systems
Temporal dimension for job submission description language
SEPADS'08 Proceedings of the 7th WSEAS International Conference on Software Engineering, Parallel and Distributed Systems
Failure Detection Service for Large Scale Systems
KES-AMSTA '07 Proceedings of the 1st KES International Symposium on Agent and Multi-Agent Systems: Technologies and Applications
Relaxed maximum a posteriori fault identification
Signal Processing
Design of the notification system for failure detectors
International Journal of High Performance Computing and Networking
Evaluating recovery aware components for grid reliability
Proceedings of the the 7th joint meeting of the European software engineering conference and the ACM SIGSOFT symposium on The foundations of software engineering
Quality-driven architecture development using architectural tactics
Journal of Systems and Software
Agent based self-healing system for grid computing
Proceedings of the International Conference and Workshop on Emerging Trends in Technology
Skip ring topology in fast failure detection service
PPAM'07 Proceedings of the 7th international conference on Parallel processing and applied mathematics
A replication structure for efficient and fault-tolerant parallel and distributed simulations
SpringSim '10 Proceedings of the 2010 Spring Simulation Multiconference
MGC middleware for grid computing: the Globus Toolkit
ACAI '11 Proceedings of the International Conference on Advances in Computing and Artificial Intelligence
Federate Fault Tolerance in HLA-Based Simulation
PADS '10 Proceedings of the 2010 IEEE Workshop on Principles of Advanced and Distributed Simulation
Development of fault tolerant grid applications using distributed b
IFM'05 Proceedings of the 5th international conference on Integrated Formal Methods
Proceedings of the 16th International ACM Sigsoft symposium on Component-based software engineering
Hi-index | 0.00 |
The potential for faults in distributed computing systems is a significant complicating factor for application developers. While a variety of techniques exist for detecting and correcting faults, the implementation of these techniques in a particular context can be difficult. Hence, we propose a fault detection service designed to be incorporated, in a modular fashion, into distributed computing systems, tools, or applications. This service uses well-known techniques based on unreliable fault detectors to detect and report component failure, while allowing the user to trade off timeliness of reporting against false positive rates. We describe the architecture of this service, report on experimental results that quantify its cost and accuracy, and describe its use in two applications, monitoring the status of system components of the GUSTO computational grid testbed and as part of the NetSolve network-enabled numerical solver.