Anomaly management in grid environments

Authors:
Ian Foster;Lingyun Yang
Affiliations:
The University of Chicago;The University of Chicago
Venue:
Anomaly management in grid environments
Year:
2007

Citing 0
Cited 1

Log summarization and anomaly detection for troubleshooting distributed systems

GRID '07 Proceedings of the 8th IEEE/ACM International Conference on Grid Computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

Recent experience in deploying Grid middleware has demonstrated the challenges one faces in delivering robust service in distributed and shared environments. In particular, unexpected ("anomalous") variations in resource availability and performance can cause significant difficulties for applications that need to deliver reliable performance to their users. Preventing, detecting, and diagnosing such unexpected anomalous behaviors---what is known as anomaly management---is not a new concept, and has been studied in many areas. However, the autonomy, heterogeneity and dynamicity of Grid environments introduce particular difficulties, as do the complexities of the often tightly coupled applications executed in such environments. In this context, I hypothesize that: by incorporating anomaly management mechanisms into Grid systems, we can allow end users to prevent, detect, and diagnose application-level anomalies in complex Grid environments. To evaluate this thesis, we study new challenges in the three aspects of the application anomaly management in the Grid environments: (1) avoiding anomalies before they occur, (2) detecting application anomalies when they occur, and (3) diagnosing why anomalies occur, once they are detected. We present novel techniques to solve these challenges and also evaluate each technique using real applications. We conduct experiments that show that our new techniques can help users detect and diagnose the cause of performance anomalies. This information provides the data needed to achieve reliable application-level performance, even when resource performance or availability may change during application execution.