Adaptive diagnosis in distributed systems

Authors:
I. Rish;M. Brodie;Sheng Ma;N. Odintsova;A. Beygelzimer;G. Grabarnik;K. Hernandez
Affiliations:
IBM T.J. Watson Res. Center, Hawthorne, NY, USA;-;-;-;-;-;-
Venue:
IEEE Transactions on Neural Networks
Year:
2005

Citing 0
Cited 23

Blind source separation approach to performance diagnosis and dependency discovery

Proceedings of the 7th ACM SIGCOMM conference on Internet measurement
Bayesian Methods for Practical Traitor Tracing

ACNS '07 Proceedings of the 5th international conference on Applied Cryptography and Network Security
Performance Problem Determination Using Combined Dependency Analysis for Reliable System

ATC '08 Proceedings of the 5th international conference on Autonomic and Trusted Computing
Active Diagnosis of High-Level Faults in Distributed Internet Services

APNOMS '08 Proceedings of the 11th Asia-Pacific Symposium on Network Operations and Management: Challenges for Next Generation Network Operations and Service Management
Toward autonomic grids: analyzing the job flow with affinity streaming

Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
Multi-scale Real-Time Grid Monitoring with Job Stream Mining

CCGRID '09 Proceedings of the 2009 9th IEEE/ACM International Symposium on Cluster Computing and the Grid
Adaptive traitor tracing with Bayesian networks

IAAI'07 Proceedings of the 19th national conference on Innovative applications of artificial intelligence - Volume 2
Optimal testing of structured knowledge

AAAI'08 Proceedings of the 23rd national conference on Artificial intelligence - Volume 2
A rule-based CBR approach for expert finding and problem diagnosis

Expert Systems with Applications: An International Journal
Probabilistic fault diagnosis for IT services in noisy and dynamic environments

IM'09 Proceedings of the 11th IFIP/IEEE international conference on Symposium on Integrated Network Management
Towards an optimized model of incident ticket correlation

IM'09 Proceedings of the 11th IFIP/IEEE international conference on Symposium on Integrated Network Management
Scalable diagnosis in IP networks using path-based measurement and inference: A learning framework

Journal of Visual Communication and Image Representation
Problem localization for automated system management in ubiquitous computing

EUC'07 Proceedings of the 2007 conference on Emerging direction in embedded and ubiquitous computing
Probabilistic fault diagnosis using adaptive probing

DSOM'07 Proceedings of the Distributed systems: operations and management 18th IFIP/IEEE international conference on Managing virtualization of networks and services
Sparse signal recovery with exponential-family noise

Allerton'09 Proceedings of the 47th annual Allerton conference on Communication, control, and computing
Information theoretic adaptive tracking of epidemics in complex networks

Allerton'09 Proceedings of the 47th annual Allerton conference on Communication, control, and computing
Efficient active probing for fault diagnosis in large scale and noisy networks

INFOCOM'10 Proceedings of the 29th conference on Information communications
Fault diagnosis in IP networks via multicast probing: noisy measurements

Sarnoff'10 Proceedings of the 33rd IEEE conference on Sarnoff
Leveraging many simple statistical models to adaptively monitor software systems

International Journal of High Performance Computing and Networking
A probe prediction approach to overlay network monitoring

Proceedings of the 7th International Conference on Network and Services Management
Distributed Monitoring with Collaborative Prediction

CCGRID '12 Proceedings of the 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012)
Efficient probe selection for fault localization using the property of submodularity

International Journal of Communication Systems
Efficient distributed monitoring with active Collaborative Prediction

Future Generation Computer Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

Real-time problem diagnosis in large distributed computer systems and networks is a challenging task that requires fast and accurate inferences from potentially huge data volumes. In this paper, we propose a cost-efficient, adaptive diagnostic technique called active probing . Probes are end-to-end test transactions that collect information about the performance of a distributed system. Active probing uses probabilistic reasoning techniques combined with information-theoretic approach, and allows a fast online inference about the current system state via active selection of only a small number of most-informative tests. We demonstrate empirically that the active probing scheme greatly reduces both the number of probes (from 60% to 75% in most of our real-life applications), and the time needed for localizing the problem when compared with nonadaptive (preplanned) probing schemes. We also provide some theoretical results on the complexity of probe selection, and the effect of "noisy" probes on the accuracy of diagnosis. Finally, we discuss how to model the system's dynamics using dynamic Bayesian networks (DBNs), and an efficient approximate approach called sequential multifault; empirical results demonstrate clear advantage of such approaches over "static" techniques that do not handle system's changes.