Towards highly available and scalable high performance clusters

Authors:
Azzedine Boukerche;Raed A. Al-Shaikh;Mirela Sechi Moretti Annoni Notare
Affiliations:
Paradise Research Laboratory, Site, University of Ottawa, Canada;Paradise Research Laboratory, Site, University of Ottawa, Canada and EXPEC Computer Center (ECC), Saudi Aramco, Saudi Arabia;Barddal University, Brazil
Venue:
Journal of Computer and System Sciences
Year:
2007

Citing 10
Cited 0

MPI: The Complete Reference

MPI: The Complete Reference
Low-Latency, Concurrent Checkpointing for Parallel Programs

IEEE Transactions on Parallel and Distributed Systems
CoCheck: Checkpointing and Process Migration for MPI

IPPS '96 Proceedings of the 10th International Parallel Processing Symposium
Egida: An Extensible Toolkit For Low-Overhead Fault-Tolerance

FTCS '99 Proceedings of the Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing
Starfish: Fault-Tolerant Dynamic MPI Programs on Clusters of Workstations

HPDC '99 Proceedings of the 8th IEEE International Symposium on High Performance Distributed Computing
Recovering Internet Service Sessions from Operating System Failures

IEEE Internet Computing
Fault tolerant high performance computing by a coding approach

Proceedings of the tenth ACM SIGPLAN symposium on Principles and practice of parallel programming
Building and Using a Fault-Tolerant MPI Implementation

International Journal of High Performance Computing Applications
Improved message logging versus improved coordinated checkpointing for fault tolerant MPI

CLUSTER '04 Proceedings of the 2004 IEEE International Conference on Cluster Computing
Highly reliable linux HPC clusters: self-awareness approach

ISPA'04 Proceedings of the Second international conference on Parallel and Distributed Processing and Applications

Quantified Score

Hi-index	0.00

Visualization

Abstract

In recent years, we have witnessed a growing interest in high performance computing (HPC) using a cluster of workstations. This growth made it affordable to individuals to have exclusive access to their own supercomputers. However, one of the challenges in a clustered environment is to keep system failure to the minimum and to achieve the highest possible level of system availability. High-Availability (HA) computing attempts to avoid the problems of unexpected failures through active redundancy and preemptive measures. Since the price of hardware components are significantly dropping, we propose to combine both HPC and HA concepts and layout the design of a HA-HPC cluster, considering all possible measures. In particular, we explore the hardware and the management layers of the HA-HPC cluster design, as well as a more focused study on the parallel-applications layer (i.e. FT-MPI implementations). Our findings show that combining HPC and HA architectures is feasible, in order to achieve HA cluster that is used for High Performance Computing.