Distributed Throughput Optimization for Large-Scale Scientific Workflows Under Fault-Tolerance Constraint

Authors:
Yi Gu;Chase Qishi Wu;Xin Liu;Dantong Yu
Affiliations:
Department of Management, Marketing, Computer Science & Info System, The University of Tennessee at Martin, Martin, USA 38237;Department of Computer Science, The University of Memphis, Memphis, USA 38152;Computational Science Center, Brookhaven National Laboratory, Upton, USA 11973;Computational Science Center, Brookhaven National Laboratory, Upton, USA 11973
Venue:
Journal of Grid Computing
Year:
2013

Citing 24
Cited 0

Task Allocation for Maximizing Reliability of Distributed Computer Systems

IEEE Transactions on Computers
Dynamic Critical-Path Scheduling: An Effective Technique for Allocating Task Graphs to Multiprocessors

IEEE Transactions on Parallel and Distributed Systems
Task Allocation Algorithms for Maximizing Reliability of Distributed Computing Systems

IEEE Transactions on Computers
Static scheduling algorithms for allocating directed task graphs to multiprocessors

ACM Computing Surveys (CSUR)
Performance-Effective and Low-Complexity Task Scheduling for Heterogeneous Computing

IEEE Transactions on Parallel and Distributed Systems
Matching and Scheduling Algorithms for Minimizing Execution Time and Failure Probability of Applications in Heterogeneous Computing

IEEE Transactions on Parallel and Distributed Systems
Computers and Intractability: A Guide to the Theory of NP-Completeness

Computers and Intractability: A Guide to the Theory of NP-Completeness
A Compile-Time Scheduling Heuristic for Interconnection-Constrained Heterogeneous Processor Architectures

IEEE Transactions on Parallel and Distributed Systems
Experimental Assessment of Workstation Failures and Their Impact on Checkpointing Systems

FTCS '98 Proceedings of the The Twenty-Eighth Annual International Symposium on Fault-Tolerant Computing
A Task Duplication Based Scheduling Algorithm for Heterogeneous Systems

IPDPS '00 Proceedings of the 14th International Symposium on Parallel and Distributed Processing
Optimal and Suboptimal Reliable Scheduling of Precedence-Constrained Tasks in Heterogeneous Distributed Computing

ICPP '00 Proceedings of the 2000 International Workshop on Parallel Processing
An Incremental Genetic Algorithm Approach to Multiprocessor Scheduling

IEEE Transactions on Parallel and Distributed Systems
Network-Aware Operator Placement for Stream-Processing Systems

ICDE '06 Proceedings of the 22nd International Conference on Data Engineering
Bi-objective scheduling algorithms for optimizing makespan and reliability on heterogeneous systems

Proceedings of the nineteenth annual ACM symposium on Parallel algorithms and architectures
MapReduce: simplified data processing on large clusters

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Mapping pipeline skeletons onto heterogeneous platforms

Journal of Parallel and Distributed Computing
A Decentralized and Cooperative Workflow Scheduling Algorithm

CCGRID '08 Proceedings of the 2008 Eighth IEEE International Symposium on Cluster Computing and the Grid
Reliability versus performance for critical applications

Journal of Parallel and Distributed Computing
Optimizing the Latency of Streaming Applications under Throughput and Reliability Constraints

ICPP '09 Proceedings of the 2009 International Conference on Parallel Processing
Job Admission and Resource Allocation in Distributed Streaming Systems

Job Scheduling Strategies for Parallel Processing
Optimizing End-to-end Performance of Distributed Applications with Linear Computing Pipelines

ICPADS '09 Proceedings of the 2009 15th International Conference on Parallel and Distributed Systems
Cooperative and decentralized workflow scheduling in global grids

Future Generation Computer Systems
Scalable fault tolerant protocol for parallel runtime environments

EuroPVM/MPI'06 Proceedings of the 13th European PVM/MPI User's Group conference on Recent advances in parallel virtual machine and message passing interface
A state-space search approach for optimizing reliability and cost of execution in distributed sensor networks

IWDC'05 Proceedings of the 7th international conference on Distributed Computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

With the advent of next-generation scientific applications, the workflow approach that integrates various computing and networking technologies has provided a viable solution to managing and optimizing large-scale distributed data transfer, processing, and analysis. This paper investigates a problem of mapping distributed scientific workflows for maximum throughput in faulty networks where nodes and links are subject to probabilistic failures. We formulate this problem as a bi-objective optimization problem to maximize both throughput and reliability. By adapting and modifying a centralized fault-free workflow mapping scheme, we propose a new mapping algorithm to achieve high throughput for smooth data flow in a distributed manner while satisfying a pre-specified bound of the overall failure rate for a guaranteed level of reliability. The performance superiority of the proposed solution is illustrated by both extensive simulation-based comparisons with existing algorithms and experimental results from a real-life scientific workflow deployed in wide-area networks.