A network performance tool for grid environments
SC '99 Proceedings of the 1999 ACM/IEEE conference on Supercomputing
CoG kits: a bridge between commodity distributed computing and high-performance grids
Proceedings of the ACM 2000 conference on Java Grande
Applying NetSolve's Network-Enabled Server
IEEE Computational Science & Engineering
Automatic Reincarnation of Deceased Plug-Ins in the HARNESS Metacomputing System
IPDPS '02 Proceedings of the 16th International Parallel and Distributed Processing Symposium
Fault Tolerant Wide-Area Parallel Computing
IPDPS '00 Proceedings of the 15 IPDPS 2000 Workshops on Parallel and Distributed Processing
JSSPP '02 Revised Papers from the 8th International Workshop on Job Scheduling Strategies for Parallel Processing
An Infrastructure for Monitoring and Management in Computational Grids
LCR '00 Selected Papers from the 5th International Workshop on Languages, Compilers, and Run-Time Systems for Scalable Computers
A Problem-Specific Fault-Tolerance Mechanism for Asynchronous, Distributed Systems
ICPP '00 Proceedings of the Proceedings of the 2000 International Conference on Parallel Processing
Faults in Grids: Why are they so bad and What can be done about it?
GRID '03 Proceedings of the 4th International Workshop on Grid Computing
Computing the performability of layered distributed systems with a management architecture
WOSP '04 Proceedings of the 4th international workshop on Software and performance
Analyzing the effectiveness of fault-management architectures in layered distributed systems
Performance Evaluation - Dependable systems and networks-performance and dependability symposium (DSN-PDS) 2002: Selected papers
Real-Time Strategy and Practice in Service Grid
COMPSAC '04 Proceedings of the 28th Annual International Computer Software and Applications Conference - Volume 01
A taxonomy of grid monitoring systems
Future Generation Computer Systems
Distributing MCell Simulations on the Grid
International Journal of High Performance Computing Applications
Fault-tolerant grid resource management infrastructure
Neural, Parallel & Scientific Computations - Special issue: Grid computing
A resource management and fault tolerance services in grid computing
Journal of Parallel and Distributed Computing - Special issue: Design and performance of networks for super-, cluster-, and grid-computing: Part II
And away we go: understanding the complexity of launching complex HPC applications
Proceedings of the second international workshop on Software engineering for high performance computing system applications
A wide-area Distribution Network for free software
ACM Transactions on Internet Technology (TOIT)
A health-check model for autonomic systems based on a pulse monitor
The Knowledge Engineering Review
Worldwide computing: Adaptive middleware and programming technology for dynamic Grid environments
Scientific Programming - Dynamic Grids and Worldwide Computing
Latency and bandwidth-minimizing failure detectors
Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007
The Internet Operating System: Middleware for Adaptive Distributed Computing
International Journal of High Performance Computing Applications
QoS management in distributed service oriented systems
PDCN'07 Proceedings of the 25th conference on Proceedings of the 25th IASTED International Multi-Conference: parallel and distributed computing and networks
An approach to grid resource selection and fault management based on ECA rules
Future Generation Computer Systems
Goal graph based performance improvement for self-adaptive modules
Proceedings of the 2nd international conference on Ubiquitous information management and communication
Grid Application Fault Diagnosis Using Wrapper Services and Machine Learning
ICSOC '07 Proceedings of the 5th international conference on Service-Oriented Computing
A group membership service for large-scale grids
Proceedings of the 6th international workshop on Middleware for grid computing
Utility-driven proactive management of availability in enterprise-scale information flows
Proceedings of the ACM/IFIP/USENIX 2006 International Conference on Middleware
Design of the notification system for failure detectors
International Journal of High Performance Computing and Networking
A taxonomy of grid monitoring systems
Future Generation Computer Systems
An adaptive task-level fault-tolerant approach to Grid
The Journal of Supercomputing
An evaluation of globus and legion software environments
ICCS'03 Proceedings of the 2003 international conference on Computational science: PartII
visPerf: monitoring tool for grid computing
ICCS'03 Proceedings of the 2003 international conference on Computational science: PartIII
A fault tolerance service for QoS in grid computing
ICCS'03 Proceedings of the 2003 international conference on Computational science: PartIII
Extending self-healing in grid environment by pulse monitoring
Multiagent and Grid Systems
Performance evaluation of fault tolerance techniques in grid computing system
Computers and Electrical Engineering
A fault avoidance strategy improving the reliability of the EGI production grid infrastructure
OPODIS'10 Proceedings of the 14th international conference on Principles of distributed systems
Journal of Intelligent Manufacturing
Architecture-based fault tolerance support for grid applications
Proceedings of the joint ACM SIGSOFT conference -- QoSA and ACM SIGSOFT symposium -- ISARCS on Quality of software architectures -- QoSA and architecting critical systems -- ISARCS
Utility-driven proactive management of availability in enterprise-scale information flows
Middleware'06 Proceedings of the 7th ACM/IFIP/USENIX international conference on Middleware
Fault-tolerant dynamic job scheduling policy
ICA3PP'05 Proceedings of the 6th international conference on Algorithms and Architectures for Parallel Processing
Robust parallel job scheduling infrastructure for service-oriented grid computing systems
ICCSA'05 Proceedings of the 2005 international conference on Computational Science and Its Applications - Volume Part IV
Replication based fault tolerant job scheduling strategy for economy driven grid
The Journal of Supercomputing
On affirmative adaptive failure detection
ICA3PP'12 Proceedings of the 12th international conference on Algorithms and Architectures for Parallel Processing - Volume Part II
A Failure Detection System for Large Scale Distributed Systems
International Journal of Distributed Systems and Technologies
A SLA graph model for data services
Proceedings of the fifth international workshop on Cloud data management
Hi-index | 0.00 |
The potential for faults in distributed computing systems is a significant complicating factor for application developers. While a variety of techniques exist for detecting and correcting faults, the implementation of these techniques in a particular context can be difficult. Hence, we propose a fault detection service designed to be incorporated, in a modular fashion, into distributed computing systems, tools, or applications. This service uses well-known techniques based on unreliable fault detectors to detect and report component failure, while allowing the user to tradeoff timeliness of reporting against false positive rates. We describe the architecture of this service, report on experimental results that quantify its cost and accuracy, and describe its use in two applications, monitoring the status of system components of the GUSTO computational grid testbed and as part of the NetSolve network-enabled numerical solver.