Failure-aware workflow scheduling in cluster environments

Authors:
Zhifeng Yu;Chenjia Wang;Weisong Shi
Affiliations:
Wayne State University, Detroit, USA;Wayne State University, Detroit, USA;Wayne State University, Detroit, USA
Venue:
Cluster Computing
Year:
2010

Citing 17
Cited 1

Performance-Effective and Low-Complexity Task Scheduling for Heterogeneous Computing

IEEE Transactions on Parallel and Distributed Systems
Matching and Scheduling Algorithms for Minimizing Execution Time and Failure Probability of Applications in Heterogeneous Computing

IEEE Transactions on Parallel and Distributed Systems
Computers and Intractability: A Guide to the Theory of NP-Completeness

Computers and Intractability: A Guide to the Theory of NP-Completeness
GridWorkflow: A Flexible Failure Handling Framework for the Grid

HPDC '03 Proceedings of the 12th IEEE International Symposium on High Performance Distributed Computing
Workflow management in GriPhyN

Grid resource management
Proactive Fault Handling for System Availability Enhancement

IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Workshop 16 - Volume 17
Performance Implications of Periodic Checkpointing on Large-Scale Cluster Systems

IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Workshop 18 - Volume 19
Filtering Failure Logs for a BlueGene/L Prototype

DSN '05 Proceedings of the 2005 International Conference on Dependable Systems and Networks
Service placement in shared wide-area platforms

Proceedings of the twentieth ACM symposium on Operating systems principles
Exploit Failure Prediction for Adaptive Fault-Tolerance in Cluster Computing

CCGRID '06 Proceedings of the Sixth IEEE International Symposium on Cluster Computing and the Grid
A large-scale study of failures in high-performance computing systems

DSN '06 Proceedings of the International Conference on Dependable Systems and Networks
Fault-Driven Re-Scheduling For Improving System-level Fault Resilience

ICPP '07 Proceedings of the 2007 International Conference on Parallel Processing
Anomaly detection and diagnosis in grid environments

Proceedings of the 2007 ACM/IEEE conference on Supercomputing
Exploring event correlation for failure prediction in coalitions of clusters

Proceedings of the 2007 ACM/IEEE conference on Supercomputing
A Planner-Guided Scheduling Strategy for Multiple Workflow Applications

ICPPW '08 Proceedings of the 2008 International Conference on Parallel Processing - Workshops
Predicting failures of computer systems: a case study for a telecommunication system

IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Performance implications of failures in large-scale cluster scheduling

JSSPP'04 Proceedings of the 10th international conference on Job Scheduling Strategies for Parallel Processing

Reliable workflow scheduling with less resource redundancy

Parallel Computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

The goal of workflow application scheduling is to achieve minimal makespan for each workflow. Scheduling workflow applications in high performance cluster environments is an NP-Complete problem, and becomes more complicated when potential resource failures are considered. While more research on failure prediction has been witnessed in recent years to improve system availability and reliability, very few of them attack the problem in the context of workflow application scheduling. In this paper, we study how a workflow scheduler benefits from failure prediction and propose FLAW, a failure-aware workflow scheduling algorithm. We propose two important definitions on accuracy, Application Oblivious Accuracy (AOA) and Application Aware Accuracy (AAA), from the perspectives of system and scheduling respectively, as we observe that the prediction accuracy defined conventionally imposes different performance implications on different applications and fails to measure how that improves scheduling effectiveness. The comprehensive evaluation results using real failure traces show that FLAW performs well with practically achievable prediction accuracy by reducing the average makespan, the loss time and the number of job rescheduling.