Fault Management in Distributed Systems: A Policy-Driven Approach

Authors:
Hanan L. Lutfiyya;Michael A. Bauer;Andrew D. Marshall;David K. Stokes
Affiliations:
Department of Computer Science, The University of Western Ontario, London, Canada. hanan@csd.uwo.ca;Department of Computer Science, The University of Western Ontario, London, Canada;Department of Computer Science, The University of Western Ontario, London, Canada;Department of Computer Science, The University of Western Ontario, London, Canada
Venue:
Journal of Network and Systems Management
Year:
2000

Citing 20
Cited 2

Understanding DCE

Understanding DCE
Reference architecture for distributed systems management

IBM Systems Journal
Network and distributed systems management

Network and distributed systems management
Domains: a framework for structuring management policy

Network and distributed systems management
Conflict analysis for management policies

Proceedings of the fifth IFIP/IEEE international symposium on Integrated network management V : integrated management in a virtual world: integrated management in a virtual world
Fault isolation and event correlation for integrated fault management

Proceedings of the fifth IFIP/IEEE international symposium on Integrated network management V : integrated management in a virtual world: integrated management in a virtual world
Using a classification of management policies for policy specification and policy transformation

Proceedings of the fourth international symposium on Integrated network management IV
Towards a practical alarm correlation system

Proceedings of the fourth international symposium on Integrated network management IV
A coding approach to event correlation

Proceedings of the fourth international symposium on Integrated network management IV
Using master tickets as a storage for problem-solving expertise

Proceedings of the fourth international symposium on Integrated network management IV
Services supporting management of distributed applications and systems

IBM Systems Journal
Towards A Role-Based Framework for DistributedSystems Management

Journal of Network and Systems Management
A Case-Based Reasoning Approach to the Resolution of Faults in Communication Networks

Proceedings of the IFIP TC6/WG6.6 Third International Symposium on Integrated Network Management with participation of the IEEE Communications Society CNOM and with support from the Institute for Educational Services
Event Correlation in Heterogeneous Networks Using the OSI Management Framework

Proceedings of the IFIP TC6/WG6.6 Third International Symposium on Integrated Network Management with participation of the IEEE Communications Society CNOM and with support from the Institute for Educational Services
Configuration maintenance for distributed applications management

CASCON '97 Proceedings of the 1997 conference of the Centre for Advanced Studies on Collaborative research
Making Distributed Applications Manageable Through Instrumentation

PDSE '97 Proceedings of the 2nd International Workshop on Software Engineering for Parallel and Distributed Systems
On a rule based management architecture

SDNE '95 Proceedings of the 2nd International Workshop on Services in Distributed and Networked Environments
Policy Definition Language for Automated Management of Distributed Systems

SMW '96 Proceedings of the 2nd IEEE International Workshop on Systems Management (SMW'96)
A General Object Model for the Management of Distributed Applications

SMW '96 Proceedings of the 2nd IEEE International Workshop on Systems Management (SMW'96)
Efficient Management Data Acquisition and Run-time Control of DCE Applications Using the OSI Management Framework

SMW '96 Proceedings of the 2nd IEEE International Workshop on Systems Management (SMW'96)

Configuring policies in public health applications

Expert Systems with Applications: An International Journal
A Survey of Fault Management in Wireless Sensor Networks

Journal of Network and Systems Management

Quantified Score

Hi-index	0.00

Visualization

Abstract

Managing the availability and performance of a distributed system involves monitoring the behavior of the system, identifying system problems, and correcting those problems. Each of these tasks requires some expertise, such as an understanding of the mechanics of the underlying system components. As the size and complexity of these systems increases, and the number of distributed applications executing on these systems increases, managing the availability and performance of distributed systems becomes more difficult. Little research has focused on embedding systems management expertise into a management application for a distributed system. In this paper we describe a rule-based management application for a commercially available distributed computing environment that is capable of monitoring the distributed system, detecting system service-related performance and availability problems, and generating corrective actions to correct the problems.