Performance under Failures of DAG-based Parallel Computing

Authors:
Hui Jin;Xian-He Sun;Ziming Zheng;Zhiling Lan;Bing Xie
Affiliations:
-;-;-;-;-
Venue:
CCGRID '09 Proceedings of the 2009 9th IEEE/ACM International Symposium on Cluster Computing and the Grid
Year:
2009

Citing 21
Cited 2

Fundamentals of queueing theory (2nd ed.).

Fundamentals of queueing theory (2nd ed.).
Task Allocation for Maximizing Reliability of Distributed Computer Systems

IEEE Transactions on Computers
Stochastic performance models of parallel task systems (extended abstract)

SIGMETRICS '94 Proceedings of the 1994 ACM SIGMETRICS conference on Measurement and modeling of computer systems
Minimizing completion time of a program by checkpointing and rejuvenation

Proceedings of the 1996 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Safety and Reliability Driven Task Allocation in Distributed Systems

IEEE Transactions on Parallel and Distributed Systems
Static scheduling algorithms for allocating directed task graphs to multiprocessors

ACM Computing Surveys (CSUR)
A first order approximation to the optimum checkpoint interval

Communications of the ACM
A comparison of list schedules for parallel processing systems

Communications of the ACM
Performance-Effective and Low-Complexity Task Scheduling for Heterogeneous Computing

IEEE Transactions on Parallel and Distributed Systems
Operating Systems Theory

Operating Systems Theory
Dynamically forecasting network performance using the Network Weather Service

Cluster Computing
Performance Modeling and Prediction of Nondedicated Network Computing

IEEE Transactions on Computers
Estimating the execution time distribution for a task graph in a heterogeneous computing system

HCW '97 Proceedings of the 6th Heterogeneous Computing Workshop (HCW '97)
Task Scheduling Algorithms for Heterogeneous Processors

HCW '99 Proceedings of the Eighth Heterogeneous Computing Workshop
A Stochastic Approach to Estimating Earliest Start Times of Nodes for Scheduling DAGs on Heterogeneous Distributed Computing Systems

IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Workshop 1 - Volume 02
A taxonomy of scientific workflow systems for grid computing

ACM SIGMOD Record
A large-scale study of failures in high-performance computing systems

DSN '06 Proceedings of the International Conference on Dependable Systems and Networks
Scalable diskless checkpointing for large parallel systems

Scalable diskless checkpointing for large parallel systems
Dynamo: amazon's highly available key-value store

Proceedings of twenty-first ACM SIGOPS symposium on Operating systems principles
Performance under failures of high-end computing

Proceedings of the 2007 ACM/IEEE conference on Supercomputing
Preserving time in large-scale communication traces

Proceedings of the 22nd annual international conference on Supercomputing

A Cost-Effective Mechanism for Cloud Data Reliability Management Based on Proactive Replica Checking

CCGRID '12 Proceedings of the 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012)
Performance comparison under failures of MPI and MapReduce: An analytical approach

Future Generation Computer Systems

Quantified Score

Hi-index	0.01

Visualization

Abstract

As the scale and complexity of parallel systems continue to grow, failures become more and more an inevitable fact for solving large-scale applications. In this research, we present an analytical study to estimate execution time in the presence of failures of directed acyclic graph (DAG) based Scientific Applications and provide a guideline for performance optimization. The study is four fold. We first introduce a performance model to predict individual subtask computation time under failures. Next, a layered, iterative approach is adopted to transform a DAG into a layered DAG, which reflects full dependencies among all the subtasks. Then, the expected execution time under failures of the DAG is derived based on stochastic analysis. Unlike existing models, this newly proposed performance model provides both the variance and distribution. It is practical and can be put to real use. Finally, based on the model, performance optimization, weak point identification and enhancement are proposed. Intensive simulations with real system traces are conducted to verify the analytical findings. They show that the newly proposed model and weak point enhancement mechanism work well.