Anomaly detection and diagnosis in grid environments

Authors:
Lingyun Yang;Chuang Liu;Jennifer M. Schopf;Ian Foster
Affiliations:
University of Chicago, Chicago, IL;Microsoft, Redmond, WA;Argonne National Laboratory, Argonne, IL;University of Chicago, Chicago, IL and Argonne National Laboratory, Argonne, IL
Venue:
Proceedings of the 2007 ACM/IEEE conference on Supercomputing
Year:
2007

Citing 15
Cited 3

Congestion avoidance and control

SIGCOMM '88 Symposium proceedings on Communications architectures and protocols
The scientist and engineer's guide to digital signal processing

The scientist and engineer's guide to digital signal processing
A parallel workload model and its implications for processor allocation

Cluster Computing
A signal analysis of network traffic anomalies

Proceedings of the 2nd ACM SIGCOMM Workshop on Internet measurment
The Cactus Code: A Problem Solving Environment for the Grid

HPDC '00 Proceedings of the 9th IEEE International Symposium on High Performance Distributed Computing
Network traffic anomaly detection based on packet bytes

Proceedings of the 2003 ACM symposium on Applied computing
IP forwarding anomalies and improving their detection using multiple data sources

Proceedings of the ACM SIGCOMM workshop on Network troubleshooting: research, theory and operations practice meet malfunctioning reality
Aberrant Behavior Detection in Time Series for Network Monitoring

LISA '00 Proceedings of the 14th USENIX conference on System administration
An integrated experimental environment for distributed systems and networks

OSDI '02 Proceedings of the 5th symposium on Operating systems design and implementationCopyright restrictions prevent ACM from being able to make the PDFs for this conference available for downloading
Ensembles of Models for Automated Diagnosis of System Performance Problems

DSN '05 Proceedings of the 2005 International Conference on Dependable Systems and Networks
Network Congestion Control: Managing Internet Traffic (Wiley Series on Communications Networking & Distributed Systems)

Network Congestion Control: Managing Internet Traffic (Wiley Series on Communications Networking & Distributed Systems)
Statistical Data Reduction for Efficient Application Performance Monitoring

CCGRID '06 Proceedings of the Sixth IEEE International Symposium on Cluster Computing and the Grid
Probabilistic anomaly detection in distributed computer networks

Science of Computer Programming
Storage-based intrusion detection: watching storage activity for suspicious behavior

SSYM'03 Proceedings of the 12th conference on USENIX Security Symposium - Volume 12
Detecting performance anomalies in global applications

WORLDS'05 Proceedings of the 2nd conference on Real, Large Distributed Systems - Volume 2

Failure-aware workflow scheduling in cluster environments

Cluster Computing
Separating Performance Anomalies from Workload-Explained Failures in Streaming Servers

CCGRID '12 Proceedings of the 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012)
Failure analysis of distributed scientific workflows executing in the cloud

Proceedings of the 8th International Conference on Network and Service Management

Quantified Score

Hi-index	0.00

Visualization

Abstract

Identifying and diagnosing anomalies in application behavior is critical to delivering reliable application-level performance. In this paper we introduce a strategy to detect anomalies and diagnose the possible reasons behind them. Our approach extends the traditional window-based strategy by using signal-processing techniques to filter out recurring, background fluctuations in resource behavior. In addition, we have developed a diagnosis technique that uses standard monitoring data to determine which related changes in behavior may cause anomalies. We evaluate our anomaly detection and diagnosis technique by applying it in three contexts when we insert anomalies into the system at random intervals. The experimental results show that our strategy detects up to 96% of anomalies while reducing the false positive rate by up to 90% compared to the traditional window average strategy. In addition, our strategy can diagnose the reason for the anomaly approximately 75% of the time.