Troubleshooting thousands of jobs on production grids using data mining techniques

Authors:
D. A. Cieslak;N. V. Chawla;D. L. Thain
Affiliations:
Dept. of Comput. Sci.&Eng., Univ. of Notre Dame, Notre Dame, IN;Dept. of Comput. Sci.&Eng., Univ. of Notre Dame, Notre Dame, IN;Dept. of Comput. Sci.&Eng., Univ. of Notre Dame, Notre Dame, IN
Venue:
GRID '08 Proceedings of the 2008 9th IEEE/ACM International Conference on Grid Computing
Year:
2008

Citing 15
Cited 3

A security architecture for computational grids

CCS '98 Proceedings of the 5th ACM conference on Computer and communications security
Bugs as deviant behavior: a general approach to inferring errors in systems code

SOSP '01 Proceedings of the eighteenth ACM symposium on Operating systems principles
A Resource Management Architecture for Metacomputing Systems

IPPS/SPDP '98 Proceedings of the Workshop on Job Scheduling Strategies for Parallel Processing
Pinpoint: Problem Determination in Large, Dynamic Internet Services

DSN '02 Proceedings of the 2002 International Conference on Dependable Systems and Networks
Matchmaking: Distributed Resource Management for High Throughput Computing

HPDC '98 Proceedings of the 7th IEEE International Symposium on High Performance Distributed Computing
Condor-G: A Computation Management Agent for Multi-Institutional Grids

HPDC '01 Proceedings of the 10th IEEE International Symposium on High Performance Distributed Computing
Performance debugging for distributed systems of black boxes

SOSP '03 Proceedings of the nineteenth ACM symposium on Operating systems principles
Editorial: special issue on learning from imbalanced data sets

ACM SIGKDD Explorations Newsletter - Special issue on learning from imbalanced datasets
The Grid2003 Production Grid: Principles and Practice

HPDC '04 Proceedings of the 13th IEEE International Symposium on High Performance Distributed Computing
WAP5: black-box performance debugging for wide-area systems

Proceedings of the 15th international conference on World Wide Web
Mining for misconfigured machines in grid systems

Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining
Grid user requirements--2004: a perspective from the trenches

Cluster Computing
Exploring event correlation for failure prediction in coalitions of clusters

Proceedings of the 2007 ACM/IEEE conference on Supercomputing
Log summarization and anomaly detection for troubleshooting distributed systems

GRID '07 Proceedings of the 8th IEEE/ACM International Conference on Grid Computing
On the dynamic resource availability in grids

GRID '07 Proceedings of the 8th IEEE/ACM International Conference on Grid Computing

Error detection and error classification: failure awareness in data transfer scheduling

International Journal of Autonomic Computing
Failure analysis of distributed scientific workflows executing in the cloud

Proceedings of the 8th International Conference on Network and Service Management
Characterizing workflow-based activity on a production e-infrastructure using provenance data

Future Generation Computer Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

Large scale production computing grids introduce new challenges in debugging and troubleshooting. A user that submits a workload consisting of tens of thousands of jobs to a grid of thousands of processors has a good chance of receiving thousands of error messages as a result. How can one begin to reason about such problems? We propose that data mining techniques can be employed to classify failures according to the properties of the jobs and machines involved. We demonstrate this technique through several case studies on real workloads consisting of tens of thousands of jobs. We apply the same techniques to a yearpsilas worth of data on a 3000 CPU production grid and use it to gain a high level understanding of the system behavior.