Quantitative system performance: computer system analysis using queueing network models
Quantitative system performance: computer system analysis using queueing network models
A guide to expert systems
Algorithms
SIGCOMM '88 Symposium proceedings on Communications architectures and protocols
Two Dimensional Time-Series for Anomaly Detection and Regulation in Adaptive Systems
DSOM '02 Proceedings of the 13th IFIP/IEEE International Workshop on Distributed Systems: Operations and Management: Management Technologies for E-Commerce and E-Business Applications
Dynamic dependencies and performance improvement
LISA'08 Proceedings of the 22nd conference on Large installation system administration conference
ICMPv6 Cumulative Path Traceback in Mobile Ad Hoc networks (MANET)
Proceedings of the 2006 conference on Advances in Intelligent IT: Active Media Technology 2006
Application of anomaly detection algorithms for detecting SYN flooding attacks
Computer Communications
A real-time system-adapted anomaly detector
Information Sciences: an International Journal
Multi-site scheduling with multiple job reservations and forecasting methods
ISPA'06 Proceedings of the 4th international conference on Parallel and Distributed Processing and Applications
Selective resource characterization for evaluation of system dynamics
ACM SIGMETRICS Performance Evaluation Review
Performance troubleshooting in data centers: an annotated bibliography?
ACM SIGOPS Operating Systems Review
Hi-index | 0.01 |
Computer systems require monitoring to detect performance anomalies such as runaway processes, but problem detection and diagnosis is a complex task requiring skilled attention. Although human attention was never ideal for this task, as networks of computers grow larger and their interactions more complex, it falls far short. Existing computer-aided management systems require the administrator manually to specify fixed "trouble" thresholds. In this paper we report on an expert system that automatically sets thresholds, and detects and diagnoses performance problems on a network of Unix computers. Key to the success and scalability of this system are the time series models we developed to model the variations in workload on each host. Analysis of the load average records of 50 machines yielded models which show, for workstations with simulated problem injection, false positive and negative rates of less than 1%. The server machines most difficult to model still gave average false positive/negative rates of only 6%/32%. Observed values exceeding the expected range for a particular host cause the expert system to focus on that machine. There it applies tools with finer resolution and more discrimination, including per-command profiles gleaned from process accounting records. It makes one of 18 specific diagnoses and notifies the administrator, and optionally the user [a].