A security architecture for computational grids
CCS '98 Proceedings of the 5th ACM conference on Computer and communications security
Bugs as deviant behavior: a general approach to inferring errors in systems code
SOSP '01 Proceedings of the eighteenth ACM symposium on Operating systems principles
A Resource Management Architecture for Metacomputing Systems
IPPS/SPDP '98 Proceedings of the Workshop on Job Scheduling Strategies for Parallel Processing
Pinpoint: Problem Determination in Large, Dynamic Internet Services
DSN '02 Proceedings of the 2002 International Conference on Dependable Systems and Networks
Matchmaking: Distributed Resource Management for High Throughput Computing
HPDC '98 Proceedings of the 7th IEEE International Symposium on High Performance Distributed Computing
Condor-G: A Computation Management Agent for Multi-Institutional Grids
HPDC '01 Proceedings of the 10th IEEE International Symposium on High Performance Distributed Computing
Performance debugging for distributed systems of black boxes
SOSP '03 Proceedings of the nineteenth ACM symposium on Operating systems principles
Editorial: special issue on learning from imbalanced data sets
ACM SIGKDD Explorations Newsletter - Special issue on learning from imbalanced datasets
The Grid2003 Production Grid: Principles and Practice
HPDC '04 Proceedings of the 13th IEEE International Symposium on High Performance Distributed Computing
WAP5: black-box performance debugging for wide-area systems
Proceedings of the 15th international conference on World Wide Web
Mining for misconfigured machines in grid systems
Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining
Grid user requirements--2004: a perspective from the trenches
Cluster Computing
Exploring event correlation for failure prediction in coalitions of clusters
Proceedings of the 2007 ACM/IEEE conference on Supercomputing
Log summarization and anomaly detection for troubleshooting distributed systems
GRID '07 Proceedings of the 8th IEEE/ACM International Conference on Grid Computing
On the dynamic resource availability in grids
GRID '07 Proceedings of the 8th IEEE/ACM International Conference on Grid Computing
Error detection and error classification: failure awareness in data transfer scheduling
International Journal of Autonomic Computing
Failure analysis of distributed scientific workflows executing in the cloud
Proceedings of the 8th International Conference on Network and Service Management
Characterizing workflow-based activity on a production e-infrastructure using provenance data
Future Generation Computer Systems
Hi-index | 0.00 |
Large scale production computing grids introduce new challenges in debugging and troubleshooting. A user that submits a workload consisting of tens of thousands of jobs to a grid of thousands of processors has a good chance of receiving thousands of error messages as a result. How can one begin to reason about such problems? We propose that data mining techniques can be employed to classify failures according to the properties of the jobs and machines involved. We demonstrate this technique through several case studies on real workloads consisting of tens of thousands of jobs. We apply the same techniques to a yearpsilas worth of data on a 3000 CPU production grid and use it to gain a high level understanding of the system behavior.