Fundamentals of queueing theory (2nd ed.).
Fundamentals of queueing theory (2nd ed.).
Queueing Analysis of Fault-Tolerant Computer Systems
IEEE Transactions on Software Engineering
Fault-tolerant computer system design
Fault-tolerant computer system design
Minimizing completion time of a program by checkpointing and rejuvenation
Proceedings of the 1996 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Safety and Reliability Driven Task Allocation in Distributed Systems
IEEE Transactions on Parallel and Distributed Systems
A first order approximation to the optimum checkpoint interval
Communications of the ACM
ACM Computing Surveys (CSUR)
Host load prediction using linear models
Cluster Computing
Performance Modeling and Prediction of Nondedicated Network Computing
IEEE Transactions on Computers
Adaptive Computing on the Grid Using AppLeS
IEEE Transactions on Parallel and Distributed Systems
ICPP '00 Proceedings of the Proceedings of the 2000 International Conference on Parallel Processing
Performance Implications of Periodic Checkpointing on Large-Scale Cluster Systems
IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Workshop 18 - Volume 19
A large-scale study of failures in high-performance computing systems
DSN '06 Proceedings of the International Conference on Dependable Systems and Networks
Scalable diskless checkpointing for large parallel systems
Scalable diskless checkpointing for large parallel systems
Grid harvest service: a performance system of grid computing
Journal of Parallel and Distributed Computing
Journal of Parallel and Distributed Computing
Performance under Failures of DAG-based Parallel Computing
CCGRID '09 Proceedings of the 2009 9th IEEE/ACM International Symposium on Cluster Computing and the Grid
CCGRID '09 Proceedings of the 2009 9th IEEE/ACM International Symposium on Cluster Computing and the Grid
Journal of Parallel and Distributed Computing
Job failures in high performance computing systems: A large-scale empirical study
Computers & Mathematics with Applications
Performance comparison under failures of MPI and MapReduce: An analytical approach
Future Generation Computer Systems
Hi-index | 0.00 |
Modern high-end computers are unprecedentedly complex. Occurrence of faults is an inevitable fact in solving large-scale applications on future Petaflop machines. Many methods have been proposed in recent years to mask faults. These methods, however, impose various performance and production costs. A better understanding of faults' influence on application performance is necessary to use existing fault tolerant methods wisely. In this study, we first introduce some practical and effective performance models to predict the application completion time under system failures. These models separate the influence of failure rate, failure repair, checkpointing period, checkpointing cost, and parallel task allocation on parallel and sequential execution times. To benefit the end users of a given computing platform, we then develop effective fault-aware task scheduling algorithms to optimize application performance under system failures. Finally, extensive simulations and experiments are conducted to evaluate our prediction models and scheduling strategies with actual failure trace.