Failure analysis of distributed scientific workflows executing in the cloud

Authors:
Taghrid Samak;Dan Gunter;Monte Goode;Ewa Deelman;Gideon Juve;Fabio Silva;Karan Vahi
Affiliations:
Lawrence Berkeley National Laboratory, Berkeley, CA;Lawrence Berkeley National Laboratory, Berkeley, CA;Lawrence Berkeley National Laboratory, Berkeley, CA;University of Southern California, Marina Del Rey, CA;University of Southern California, Marina Del Rey, CA;University of Southern California, Marina Del Rey, CA;University of Southern California, Marina Del Rey, CA
Venue:
Proceedings of the 8th International Conference on Network and Service Management
Year:
2012

Citing 20
Cited 0

A large-scale study of failures in high-performance computing systems

DSN '06 Proceedings of the International Conference on Dependable Systems and Networks
Mining for misconfigured machines in grid systems

Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining
Pegasus: A framework for mapping complex scientific workflows onto distributed systems

Scientific Programming
Anomaly detection and diagnosis in grid environments

Proceedings of the 2007 ACM/IEEE conference on Supercomputing
Exploring event correlation for failure prediction in coalitions of clusters

Proceedings of the 2007 ACM/IEEE conference on Supercomputing
Analysis of application heartbeats: learning structural and temporal features in time series data for identification of performance problems

Proceedings of the 2008 ACM/IEEE conference on Supercomputing
Failure Prediction in IBM BlueGene/L Event Logs

ICDM '07 Proceedings of the 2007 Seventh IEEE International Conference on Data Mining
Log summarization and anomaly detection for troubleshooting distributed systems

GRID '07 Proceedings of the 8th IEEE/ACM International Conference on Grid Computing
Troubleshooting thousands of jobs on production grids using data mining techniques

GRID '08 Proceedings of the 2008 9th IEEE/ACM International Conference on Grid Computing
Characterizing cloud computing hardware reliability

Proceedings of the 1st ACM symposium on Cloud computing
Data Sharing Options for Scientific Workflows on Amazon EC2

Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
Performance Analysis of Cloud Computing Services for Many-Tasks Scientific Computing

IEEE Transactions on Parallel and Distributed Systems
Experiences using cloud computing for a scientific workflow application

Proceedings of the 2nd international workshop on Scientific cloud computing
Magellan: experiences from a science cloud

Proceedings of the 2nd international workshop on Scientific cloud computing
Online Fault and Anomaly Detection for Large-Scale Scientific Workflows

HPCC '11 Proceedings of the 2011 IEEE International Conference on High Performance Computing and Communications
Co-analysis of RAS Log and Job Log on Blue Gene/P

IPDPS '11 Proceedings of the 2011 IEEE International Parallel & Distributed Processing Symposium
Cloud versus in-house cluster: evaluating Amazon cluster compute instances for running MPI applications

State of the Practice Reports
Failure prediction and localization in large scientific workflows

Proceedings of the 6th workshop on Workflows in support of large-scale science
Online workflow management and performance analysis with stampede

Proceedings of the 7th International Conference on Network and Services Management
Toward Automated Anomaly Identification in Large-Scale Systems

IEEE Transactions on Parallel and Distributed Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

This work presents models characterizing failures observed during the execution of large scientific applications on Amazon EC2. Scientific workflows are used as the underlying abstraction for application representations. As scientific workflows scale to hundreds of thousands of distinct tasks, failures due to software and hardware faults become increasingly common. We study job failure models for data collected from 4 scientific applications, by our Stampede framework. In particular, we show that a Naive Bayes classifier can accurately predict the failure probability of jobs. The models allow us to predict job failures for a given execution resource and then use these failure predictions for two higher-level goals: (1) to suggest a better job assignment, and (2) to provide quantitative feedback to the workflow component developer about the robustness of their application codes.