Performance under failures of high-end computing

Authors:
Ming Wu;Xian-He Sun;Hui Jin
Affiliations:
Illinois Institute of Technology, Chicago, Illinois;Illinois Institute of Technology, Chicago, Illinois and Fermi National Accelerator Laborator, Batavia, Illinois;Illinois Institute of Technology, Chicago, Illinois
Venue:
Proceedings of the 2007 ACM/IEEE conference on Supercomputing
Year:
2007

Citing 16
Cited 6

Fundamentals of queueing theory (2nd ed.).

Fundamentals of queueing theory (2nd ed.).
Queueing Analysis of Fault-Tolerant Computer Systems

IEEE Transactions on Software Engineering
Fault-tolerant computer system design

Fault-tolerant computer system design
Minimizing completion time of a program by checkpointing and rejuvenation

Proceedings of the 1996 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Safety and Reliability Driven Task Allocation in Distributed Systems

IEEE Transactions on Parallel and Distributed Systems
A first order approximation to the optimum checkpoint interval

Communications of the ACM
Process migration

ACM Computing Surveys (CSUR)
Dynamically forecasting network performance using the Network Weather Service

Cluster Computing
Host load prediction using linear models

Cluster Computing
Performance Modeling and Prediction of Nondedicated Network Computing

IEEE Transactions on Computers
Adaptive Computing on the Grid Using AppLeS

IEEE Transactions on Parallel and Distributed Systems
Reliable Matching and Scheduling of Precedence-Constrained Tasks in Heterogeneous Distributed Computing

ICPP '00 Proceedings of the Proceedings of the 2000 International Conference on Parallel Processing
Performance Implications of Periodic Checkpointing on Large-Scale Cluster Systems

IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Workshop 18 - Volume 19
A large-scale study of failures in high-performance computing systems

DSN '06 Proceedings of the International Conference on Dependable Systems and Networks
Scalable diskless checkpointing for large parallel systems

Scalable diskless checkpointing for large parallel systems
Grid harvest service: a performance system of grid computing

Journal of Parallel and Distributed Computing

On the design of communication-aware fault-tolerant scheduling algorithms for precedence constrained tasks in grid computing systems with dedicated communication devices

Journal of Parallel and Distributed Computing
Performance under Failures of DAG-based Parallel Computing

CCGRID '09 Proceedings of the 2009 9th IEEE/ACM International Symposium on Cluster Computing and the Grid
Failure-Aware Construction and Reconfiguration of Distributed Virtual Machines for High Availability Computing

CCGRID '09 Proceedings of the 2009 9th IEEE/ACM International Symposium on Cluster Computing and the Grid
Failure-aware resource management for high-availability computing clusters with distributed virtual machines

Journal of Parallel and Distributed Computing
Job failures in high performance computing systems: A large-scale empirical study

Computers & Mathematics with Applications
Performance comparison under failures of MPI and MapReduce: An analytical approach

Future Generation Computer Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

Modern high-end computers are unprecedentedly complex. Occurrence of faults is an inevitable fact in solving large-scale applications on future Petaflop machines. Many methods have been proposed in recent years to mask faults. These methods, however, impose various performance and production costs. A better understanding of faults' influence on application performance is necessary to use existing fault tolerant methods wisely. In this study, we first introduce some practical and effective performance models to predict the application completion time under system failures. These models separate the influence of failure rate, failure repair, checkpointing period, checkpointing cost, and parallel task allocation on parallel and sequential execution times. To benefit the end users of a given computing platform, we then develop effective fault-aware task scheduling algorithms to optimize application performance under system failures. Finally, extensive simulations and experiments are conducted to evaluate our prediction models and scheduling strategies with actual failure trace.