Highly reliable linux HPC clusters: self-awareness approach

Authors:
Chokchai Leangsuksun;Tong Liu;Yudan Liu;Stephen L. Scott;Richard Libby;Ibrahim Haddad
Affiliations:
Computer Science Department, Louisiana Tech University;Enterprise Platforms Group, Dell Corp.;Computer Science Department, Louisiana Tech University;Oak Ridge National Laboratory;Intel Corporation;Ericsson Research
Venue:
ISPA'04 Proceedings of the Second international conference on Parallel and Distributed Processing and Applications
Year:
2004

Citing 1
Cited 3

SPNP: Stochastic Petri Net Package

PNPM '89 The Proceedings of the Third International Workshop on Petri Nets and Performance Models

A Mathematical Model for Performability of Beowulf Clusters

ANSS '06 Proceedings of the 39th annual Symposium on Simulation
Towards highly available and scalable high performance clusters

Journal of Computer and System Sciences
Towards building a highly-available cluster based model for high performance computing

IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing

Quantified Score

Hi-index	0.00

Visualization

Abstract

Current solutions for fault-tolerance in HPC systems focus on dealing with the result of a failure. However, most are unable to handle runtime system configuration changes caused by transient failures and require a complete restart of the entire machine. The recently released HA-OSCAR software stack is one such effort making inroads here. This paper discusses detailed solutions for the high-availability and serviceability enhancement of clusters by HA-OSCAR via multi-head-node failover and a service level fault tolerance mechanism. Our solution employs self-configuration and introduces Adaptive Self Healing (ASH) techniques. HA-OSCAR availability improvement analysis was also conducted with various sensitivity factors. Finally, the paper also entails the details of the system layering strategy, dependability modeling, and analysis of an actual experimental system by a Petri net-based model, Stochastic Reword Net (SRN).