Fault Management in Distributed Systems: A Policy-Driven Approach

  • Authors:
  • Hanan L. Lutfiyya;Michael A. Bauer;Andrew D. Marshall;David K. Stokes

  • Affiliations:
  • Department of Computer Science, The University of Western Ontario, London, Canada. hanan@csd.uwo.ca;Department of Computer Science, The University of Western Ontario, London, Canada;Department of Computer Science, The University of Western Ontario, London, Canada;Department of Computer Science, The University of Western Ontario, London, Canada

  • Venue:
  • Journal of Network and Systems Management
  • Year:
  • 2000

Quantified Score

Hi-index 0.00

Visualization

Abstract

Managing the availability and performance of a distributed system involves monitoring the behavior of the system, identifying system problems, and correcting those problems. Each of these tasks requires some expertise, such as an understanding of the mechanics of the underlying system components. As the size and complexity of these systems increases, and the number of distributed applications executing on these systems increases, managing the availability and performance of distributed systems becomes more difficult. Little research has focused on embedding systems management expertise into a management application for a distributed system. In this paper we describe a rule-based management application for a commercially available distributed computing environment that is capable of monitoring the distributed system, detecting system service-related performance and availability problems, and generating corrective actions to correct the problems.