Software fault injection: inoculating programs against errors
Software fault injection: inoculating programs against errors
System identification (2nd ed.): theory for the user
System identification (2nd ed.): theory for the user
An introduction to support Vector Machines: and other kernel-based learning methods
An introduction to support Vector Machines: and other kernel-based learning methods
Data mining: concepts and techniques
Data mining: concepts and techniques
Bitter Java
Performance debugging for distributed systems of black boxes
SOSP '03 Proceedings of the nineteenth ACM symposium on Operating systems principles
A Simple Way to Estimate the Cost of Downtime
LISA '02 Proceedings of the 16th USENIX conference on System administration
Multi-resolution Abnormal Trace Detection Using Varied-length N-grams and Automata
ICAC '05 Proceedings of the Second International Conference on Automatic Computing
Capturing, indexing, clustering, and retrieving system history
Proceedings of the twentieth ACM symposium on Operating systems principles
Why do internet services fail, and what can be done about it?
USITS'03 Proceedings of the 4th conference on USENIX Symposium on Internet Technologies and Systems - Volume 4
Detecting application-level failures in component-based Internet services
IEEE Transactions on Neural Networks
Efficient and Scalable Algorithms for Inferring Likely Invariants in Distributed Systems
IEEE Transactions on Knowledge and Data Engineering
A comparative study of pairwise regression techniques for problem determination
CASCON '07 Proceedings of the 2007 conference of the center for advanced studies on Collaborative research
Monitoring multi-tier clustered systems with invariant metric relationships
Proceedings of the 2008 international workshop on Software engineering for adaptive and self-managing systems
Information-theoretic modeling for tracking the health of complex software systems
CASCON '08 Proceedings of the 2008 conference of the center for advanced studies on collaborative research: meeting of minds
Ranking the importance of alerts for problem determination in large computer systems
ICAC '09 Proceedings of the 6th international conference on Autonomic computing
System monitoring with metric-correlation models: problems and solutions
ICAC '09 Proceedings of the 6th international conference on Autonomic computing
Heteroscedastic models to track relationships between management metrics
IM'09 Proceedings of the 11th IFIP/IEEE international conference on Symposium on Integrated Network Management
Leveraging many simple statistical models to adaptively monitor software systems
International Journal of High Performance Computing and Networking
Leveraging many simple statistical models to adaptively monitor software systems
ISPA'07 Proceedings of the 5th international conference on Parallel and Distributed Processing and Applications
Workload-aware anomaly detection for Web applications
Journal of Systems and Software
Hi-index | 0.00 |
With the prevalence of Internet services and the increase of their complexity, there is a growing need to improve their operational reliability and availability. While a large amount of monitoring data can be collected from systems for fault analysis, it is hard to correlate this data effectively across distributed systems and observation time. In this paper, we analyze the mass characteristics of user requests and propose a novel approach to model and track transaction flow dynamics for fault detection in complex information systems. We measure the flow intensity at multiple checkpoints inside the system and apply system identification methods to model transaction flow dynamics between these measurements. With the learned analytical models, a model-based fault detection and isolation method is applied to track the flow dynamics in real time for fault detection. We also propose an algorithm to automatically search and validate the dynamic relationship between randomly selected monitoring points. Our algorithm enables systems to have self-cognition capability for system management. Our approach is tested in a real system with a list of injected faults. Experimental results demonstrate the effectiveness of our approach and algorithms.