Availability Modeling and Analysis on High Performance Cluster Computing Systems

Authors:
Hertong Song;Chokchai "box" Leangsuksun;Raja Nassar
Affiliations:
Louisiana Tech University;Louisiana Tech University;Louisiana Tech University
Venue:
ARES '06 Proceedings of the First International Conference on Availability, Reliability and Security
Year:
2006

Citing 0
Cited 7

Proactive fault tolerance for HPC with Xen virtualization

Proceedings of the 21st annual international conference on Supercomputing
Exploring event correlation for failure prediction in coalitions of clusters

Proceedings of the 2007 ACM/IEEE conference on Supercomputing
Proactive process-level live migration in HPC environments

Proceedings of the 2008 ACM/IEEE conference on Supercomputing
Towards resilient high performance applications through real time reliability metric generation and autonomous failure correction

Proceedings of the 2009 workshop on Resiliency in high performance
Towards pro-active adaptation with confidence: augmenting service monitoring with online testing

Proceedings of the 2010 ICSE Workshop on Software Engineering for Adaptive and Self-Managing Systems
Proactive process-level live migration and back migration in HPC environments

Journal of Parallel and Distributed Computing
HADAB: enabling fault tolerance in parallel applications running in distributed environments

PPAM'11 Proceedings of the 9th international conference on Parallel Processing and Applied Mathematics - Volume Part I

Quantified Score

Hi-index	0.00

Visualization

Abstract

Cluster computing has been attracting more and more attention from both the industry and the academia for its enormous computing power, cost effectiveness, and scalability. Availability is a key system attribute that needs to be considered both at system design stage and must reflect the actuality. System monitoring and logging enables identifying unplanned events to reflect the actual system's availability. This paper proposes a single framework that coordinates event monitoring, filtering, data analysis and dynamic availability modeling. The availability model is abstracted and categorized based on functionality. We describe the proposed architecture, and a sample analysis of real time event logs from a 512 node cluster from Lawrence Livermore National Laboratory.