Simulating application resilience at exascale

  • Authors:
  • Rolf Riesen;Kurt B. Ferreira;Maria Ruiz Varela;Michela Taufer;Arun Rodrigues

  • Affiliations:
  • IBM Research, Ireland;Sandia National Laboratories, Albuquerque, NM;University of Delaware;University of Delaware;Sandia National Laboratories, Albuquerque, NM

  • Venue:
  • Euro-Par'11 Proceedings of the 2011 international conference on Parallel Processing - Volume 2
  • Year:
  • 2011

Quantified Score

Hi-index 0.00

Visualization

Abstract

The reliability mechanisms for future exascale systems will be a key aspect of their scalability and performance. With the expected jump in hardware component counts, faults will become increasingly common compared to today's systems. Under these circumstances, the costs of current and emergent resilience methods need to be reevaluated. This includes the cost of recovery, which is often ignored in current work, and the impact of hardware features such as heterogeneous computing elements and non-volatile memory devices. We describe a simulation and modeling framework that enables the measurement of various resilience algorithms with varying application characteristics. For this framework we outline the simulator's requirements, its application communication pattern generators, and a few of the key hardware component models.