Taming of the Shrew: Modeling the Normal and Faulty Behaviour of Large-scale HPC Systems

  • Authors:
  • Ana Gainaru;Franck Cappello;William Kramer

  • Affiliations:
  • -;-;-

  • Venue:
  • IPDPS '12 Proceedings of the 2012 IEEE 26th International Parallel and Distributed Processing Symposium
  • Year:
  • 2012

Quantified Score

Hi-index 0.00

Visualization

Abstract

HPC systems are complex machines that generate a huge volume of system state data called âeventsâ. Events are generated without following a general consistent rule and different hardware and software components of such systems have different failure rates. Distinguishing between normal system behaviour and faulty situation relies on event analysis. Being able to detect quickly deviations from normality is essential for system administration and is the foundation of fault prediction. As HPC systems continue to grow in size and complexity, mining event flows become more challenging and with the upcoming 10 Pet flop systems, there is a lot of interestin this topic. Current event mining approaches do not take into consideration the specific behaviour of each type of events and as a consequence, fail to analyze them according to their characteristics. In this paper we propose a novel way of characterizing the normal and faulty behaviour of the system by using signal analysis concepts. All analysis modules create ELSA (Event Log Signal Analyzer), a toolkit that has the purpose of modelling the normal flow of each state event during a HPC system lifetime, and how it is affected when a failure hits the system. We show that these extracted models provide an accurate view of the system output, which improves the effectiveness of proactive fault tolerance algorithms. Specifically, we implemented a filtering algorithm and short-term fault prediction methodology based on the extracted model and test it against real failure traces from a large-scale system. We show that by analyzing each event according to its specific behaviour, we get a more realistic overview of the entire system.