Fundamentals of queueing theory (2nd ed.).
Fundamentals of queueing theory (2nd ed.).
Task Allocation for Maximizing Reliability of Distributed Computer Systems
IEEE Transactions on Computers
Stochastic performance models of parallel task systems (extended abstract)
SIGMETRICS '94 Proceedings of the 1994 ACM SIGMETRICS conference on Measurement and modeling of computer systems
Minimizing completion time of a program by checkpointing and rejuvenation
Proceedings of the 1996 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Safety and Reliability Driven Task Allocation in Distributed Systems
IEEE Transactions on Parallel and Distributed Systems
Static scheduling algorithms for allocating directed task graphs to multiprocessors
ACM Computing Surveys (CSUR)
A first order approximation to the optimum checkpoint interval
Communications of the ACM
A comparison of list schedules for parallel processing systems
Communications of the ACM
Performance-Effective and Low-Complexity Task Scheduling for Heterogeneous Computing
IEEE Transactions on Parallel and Distributed Systems
Operating Systems Theory
Performance Modeling and Prediction of Nondedicated Network Computing
IEEE Transactions on Computers
Estimating the execution time distribution for a task graph in a heterogeneous computing system
HCW '97 Proceedings of the 6th Heterogeneous Computing Workshop (HCW '97)
Task Scheduling Algorithms for Heterogeneous Processors
HCW '99 Proceedings of the Eighth Heterogeneous Computing Workshop
IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Workshop 1 - Volume 02
A taxonomy of scientific workflow systems for grid computing
ACM SIGMOD Record
A large-scale study of failures in high-performance computing systems
DSN '06 Proceedings of the International Conference on Dependable Systems and Networks
Scalable diskless checkpointing for large parallel systems
Scalable diskless checkpointing for large parallel systems
Dynamo: amazon's highly available key-value store
Proceedings of twenty-first ACM SIGOPS symposium on Operating systems principles
Performance under failures of high-end computing
Proceedings of the 2007 ACM/IEEE conference on Supercomputing
Preserving time in large-scale communication traces
Proceedings of the 22nd annual international conference on Supercomputing
A Cost-Effective Mechanism for Cloud Data Reliability Management Based on Proactive Replica Checking
CCGRID '12 Proceedings of the 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012)
Performance comparison under failures of MPI and MapReduce: An analytical approach
Future Generation Computer Systems
Hi-index | 0.01 |
As the scale and complexity of parallel systems continue to grow, failures become more and more an inevitable fact for solving large-scale applications. In this research, we present an analytical study to estimate execution time in the presence of failures of directed acyclic graph (DAG) based Scientific Applications and provide a guideline for performance optimization. The study is four fold. We first introduce a performance model to predict individual subtask computation time under failures. Next, a layered, iterative approach is adopted to transform a DAG into a layered DAG, which reflects full dependencies among all the subtasks. Then, the expected execution time under failures of the DAG is derived based on stochastic analysis. Unlike existing models, this newly proposed performance model provides both the variance and distribution. It is practical and can be put to real use. Finally, based on the model, performance optimization, weak point identification and enhancement are proposed. Intensive simulations with real system traces are conducted to verify the analytical findings. They show that the newly proposed model and weak point enhancement mechanism work well.