Modeling and Tracking of Transaction Flow Dynamics for Fault Detection in Complex Systems

Authors:
Guofei Jiang;Haifeng Chen;Kenji Yoshihira
Affiliations:
IEEE;-;-
Venue:
IEEE Transactions on Dependable and Secure Computing
Year:
2006

Citing 11
Cited 12

Software fault injection: inoculating programs against errors

Software fault injection: inoculating programs against errors
System identification (2nd ed.): theory for the user

System identification (2nd ed.): theory for the user
An introduction to support Vector Machines: and other kernel-based learning methods

An introduction to support Vector Machines: and other kernel-based learning methods
Data mining: concepts and techniques

Data mining: concepts and techniques
Bitter Java

Bitter Java
Performance debugging for distributed systems of black boxes

SOSP '03 Proceedings of the nineteenth ACM symposium on Operating systems principles
A Simple Way to Estimate the Cost of Downtime

LISA '02 Proceedings of the 16th USENIX conference on System administration
Multi-resolution Abnormal Trace Detection Using Varied-length N-grams and Automata

ICAC '05 Proceedings of the Second International Conference on Automatic Computing
Capturing, indexing, clustering, and retrieving system history

Proceedings of the twentieth ACM symposium on Operating systems principles
Why do internet services fail, and what can be done about it?

USITS'03 Proceedings of the 4th conference on USENIX Symposium on Internet Technologies and Systems - Volume 4
Detecting application-level failures in component-based Internet services

IEEE Transactions on Neural Networks

Efficient and Scalable Algorithms for Inferring Likely Invariants in Distributed Systems

IEEE Transactions on Knowledge and Data Engineering
A comparative study of pairwise regression techniques for problem determination

CASCON '07 Proceedings of the 2007 conference of the center for advanced studies on Collaborative research
Monitoring multi-tier clustered systems with invariant metric relationships

Proceedings of the 2008 international workshop on Software engineering for adaptive and self-managing systems
Profiling services for resource optimization and capacity planning in distributed systems

Cluster Computing
Information-theoretic modeling for tracking the health of complex software systems

CASCON '08 Proceedings of the 2008 conference of the center for advanced studies on collaborative research: meeting of minds
Ranking the importance of alerts for problem determination in large computer systems

ICAC '09 Proceedings of the 6th international conference on Autonomic computing
System monitoring with metric-correlation models: problems and solutions

ICAC '09 Proceedings of the 6th international conference on Autonomic computing
Heteroscedastic models to track relationships between management metrics

IM'09 Proceedings of the 11th IFIP/IEEE international conference on Symposium on Integrated Network Management
Leveraging many simple statistical models to adaptively monitor software systems

International Journal of High Performance Computing and Networking
Ranking the importance of alerts for problem determination in large computer systems

Cluster Computing
Leveraging many simple statistical models to adaptively monitor software systems

ISPA'07 Proceedings of the 5th international conference on Parallel and Distributed Processing and Applications
Workload-aware anomaly detection for Web applications

Journal of Systems and Software

Quantified Score

Hi-index	0.00

Visualization

Abstract

With the prevalence of Internet services and the increase of their complexity, there is a growing need to improve their operational reliability and availability. While a large amount of monitoring data can be collected from systems for fault analysis, it is hard to correlate this data effectively across distributed systems and observation time. In this paper, we analyze the mass characteristics of user requests and propose a novel approach to model and track transaction flow dynamics for fault detection in complex information systems. We measure the flow intensity at multiple checkpoints inside the system and apply system identification methods to model transaction flow dynamics between these measurements. With the learned analytical models, a model-based fault detection and isolation method is applied to track the flow dynamics in real time for fault detection. We also propose an algorithm to automatically search and validate the dynamic relationship between randomly selected monitoring points. Our algorithm enables systems to have self-cognition capability for system management. Our approach is tested in a real system with a list of injected faults. Experimental results demonstrate the effectiveness of our approach and algorithms.