A large-scale study of failures in high-performance computing systems
DSN '06 Proceedings of the International Conference on Dependable Systems and Networks
Mining for misconfigured machines in grid systems
Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining
Pegasus: A framework for mapping complex scientific workflows onto distributed systems
Scientific Programming
Anomaly detection and diagnosis in grid environments
Proceedings of the 2007 ACM/IEEE conference on Supercomputing
Exploring event correlation for failure prediction in coalitions of clusters
Proceedings of the 2007 ACM/IEEE conference on Supercomputing
Proceedings of the 2008 ACM/IEEE conference on Supercomputing
Failure Prediction in IBM BlueGene/L Event Logs
ICDM '07 Proceedings of the 2007 Seventh IEEE International Conference on Data Mining
Log summarization and anomaly detection for troubleshooting distributed systems
GRID '07 Proceedings of the 8th IEEE/ACM International Conference on Grid Computing
Troubleshooting thousands of jobs on production grids using data mining techniques
GRID '08 Proceedings of the 2008 9th IEEE/ACM International Conference on Grid Computing
Characterizing cloud computing hardware reliability
Proceedings of the 1st ACM symposium on Cloud computing
Data Sharing Options for Scientific Workflows on Amazon EC2
Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
Performance Analysis of Cloud Computing Services for Many-Tasks Scientific Computing
IEEE Transactions on Parallel and Distributed Systems
Experiences using cloud computing for a scientific workflow application
Proceedings of the 2nd international workshop on Scientific cloud computing
Magellan: experiences from a science cloud
Proceedings of the 2nd international workshop on Scientific cloud computing
Online Fault and Anomaly Detection for Large-Scale Scientific Workflows
HPCC '11 Proceedings of the 2011 IEEE International Conference on High Performance Computing and Communications
Co-analysis of RAS Log and Job Log on Blue Gene/P
IPDPS '11 Proceedings of the 2011 IEEE International Parallel & Distributed Processing Symposium
State of the Practice Reports
Failure prediction and localization in large scientific workflows
Proceedings of the 6th workshop on Workflows in support of large-scale science
Online workflow management and performance analysis with stampede
Proceedings of the 7th International Conference on Network and Services Management
Toward Automated Anomaly Identification in Large-Scale Systems
IEEE Transactions on Parallel and Distributed Systems
Hi-index | 0.00 |
This work presents models characterizing failures observed during the execution of large scientific applications on Amazon EC2. Scientific workflows are used as the underlying abstraction for application representations. As scientific workflows scale to hundreds of thousands of distinct tasks, failures due to software and hardware faults become increasingly common. We study job failure models for data collected from 4 scientific applications, by our Stampede framework. In particular, we show that a Naive Bayes classifier can accurately predict the failure probability of jobs. The models allow us to predict job failures for a given execution resource and then use these failure predictions for two higher-level goals: (1) to suggest a better job assignment, and (2) to provide quantitative feedback to the workflow component developer about the robustness of their application codes.