Using queue structures to improve job reliability

Authors:
Thomas J. Hacker;Zdzislaw Meglicki
Affiliations:
Purdue University;Indiana University
Venue:
Proceedings of the 16th international symposium on High performance distributed computing
Year:
2007

Citing 11
Cited 3

Impact of Checkpoint Latency on Overhead Ratio of a Checkpointing Scheme

IEEE Transactions on Computers
Improving cluster availability using workstation validation

SIGMETRICS '02 Proceedings of the 2002 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Failure Data Analysis of a LAN of Windows NT Based Computers

SRDS '99 Proceedings of the 18th IEEE Symposium on Reliable Distributed Systems
The workload on parallel supercomputers: modeling the characteristics of rigid jobs

Journal of Parallel and Distributed Computing
Pastiche: making backup cheap and easy

OSDI '02 Proceedings of the 5th symposium on Operating systems design and implementationCopyright restrictions prevent ACM from being able to make the PDFs for this conference available for downloading
Automatic methods for predicting machine availability in desktop Grid and peer-to-peer systems

CCGRID '04 Proceedings of the 2004 IEEE International Symposium on Cluster Computing and the Grid
A large-scale study of failures in high-performance computing systems

DSN '06 Proceedings of the International Conference on Dependable Systems and Networks
Disk failures in the real world: what does an MTTF of 1,000,000 hours mean to you?

FAST '07 Proceedings of the 5th USENIX conference on File and Storage Technologies
Failure trends in a large disk drive population

FAST '07 Proceedings of the 5th USENIX conference on File and Storage Technologies
Performance implications of failures in large-scale cluster scheduling

JSSPP'04 Proceedings of the 10th international conference on Job Scheduling Strategies for Parallel Processing
Modeling machine availability in enterprise and wide-area distributed computing environments

Euro-Par'05 Proceedings of the 11th international Euro-Par conference on Parallel Processing

An analysis of clustered failures on large supercomputing systems

Journal of Parallel and Distributed Computing
Flexible resource allocation for reliable virtual cluster computing systems

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
A reliability model for cloud computing for high performance computing applications

Euro-Par'12 Proceedings of the 18th international conference on Parallel processing workshops

Quantified Score

Hi-index	0.00

Visualization

Abstract

Many high performance computing systems today exploit the availability and remarkable performance characteristics of stand alone server systems and the impressive price / performance ratio of commodity components. Small scale HPC systems, in the range from 16 to 64 processors, have enjoyed significant popularity and are an indispensable tool for the research community. Scaling up to hundreds and thousands of processors, however, has exposed operational issues, which include system availability and reliability. In this paper, we explore the impact of individual component reliability rates on the overall reliability of an HPC system. We derive a mathematical model for determining the failure rate of the system, the probability of failure of a job running on a subset of the system, and show how to design a reasonable queue structure to provide a reliable system over abroad job mix. We also explore the impact of reliability and queue structure on checkpoint intervals and recovery. Our results demonstrate that it is possible to design a reliable high performance computing system with very good operational reliability characteristics from a collection of moderately reliable components.