Scalable problem localization for distributed systems: principles and practices

Authors:
Rui Zhang;Bruno C. d. S. Oliveira;Alan Bivens;Steve McKeever
Affiliations:
Oxford University, Oxford, England;Oxford University, Oxford, England;IBM T. J. Watson Research Center, Hawthorne, NY;Oxford University, Oxford, England
Venue:
Proceedings of the 2nd international conference on Scalable information systems
Year:
2007

Citing 10
Cited 0

Algorithms

Algorithms
Yemanja—A Layered Fault Localization System for Multi-Domain Computing Utilities

Journal of Network and Systems Management
The Vision of Autonomic Computing

Computer
Pinpoint: Problem Determination in Large, Dynamic Internet Services

DSN '02 Proceedings of the 2002 International Conference on Dependable Systems and Networks
Astrolabe: A robust and scalable technology for distributed system monitoring, management, and data mining

ACM Transactions on Computer Systems (TOCS)
Web Search for a Planet: The Google Cluster Architecture

IEEE Micro
Introduction: Service-oriented computing

Communications of the ACM - Service-oriented computing
Performance debugging for distributed systems of black boxes

SOSP '03 Proceedings of the nineteenth ACM symposium on Operating systems principles
OGSA-based grid workload monitoring

CCGRID '05 Proceedings of the Fifth IEEE International Symposium on Cluster Computing and the Grid (CCGrid'05) - Volume 2 - Volume 02
Performance problem localization in self-healing, service-oriented systems using Bayesian networks

Proceedings of the 2007 ACM symposium on Applied computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

Problem localization is a critical part of providing crucial system management capabilities to modern distributed environments. One key open challenge is for problem localization solutions to scale for systems containing hundreds or even thousands of nodes, whilst still remaining fast enough to respond to rapid environment changes and sufficiently cost-effective to avoid overloading any management or application component. This paper meets the challenge by introducing two scalable frameworks applicable to a wide range of existing problem localization solutions: one based on a summary-driven, narrow-down procedure, the other through decomposing and decentralizing the problem localization process. Both frameworks, at their best, are able to achieve O(logN) problem localization time and O(1) per node communication load. The contrasting natures of both frameworks provide them with complimentary strengths that make them suitable for different scenarios in practice. We demonstrate our approaches in simulation settings and two real-world environments and show promising scalability benefits that can make a difference in system management operations.