Fault resilience of the algebraic multi-grid solver

  • Authors:
  • Marc Casas;Bronis R. de Supinski;Greg Bronevetsky;Martin Schulz

  • Affiliations:
  • Lawrence Livermore National Laboratory, Livermore, USA;Lawrence Livermore National Laboratory, Livermore, USA;Lawrence Livermore National Laboratory, Livermore , USA;Lawrence Livermore National Laboratory, Livermore , USA

  • Venue:
  • Proceedings of the 26th ACM international conference on Supercomputing
  • Year:
  • 2012

Quantified Score

Hi-index 0.00

Visualization

Abstract

As HPC system sizes grow to millions of cores and chip feature sizes continue to decrease, HPC applications become increasingly exposed to transient hardware faults. These faults can cause aborts and performance degradation. Most importantly, they can corrupt results. Thus, we must evaluate the fault vulnerability of key HPC algorithms to develop cost-effective techniques to improve application resilience. We present an approach that analyzes the vulnerability of applications to faults, systematically reduces it by protecting the most vulnerable components and predicts application vulnerability at large scales. Weinitially focus on sparse scientific applications and apply our approachin this paper to the Algebraic Multi Grid (AMG) algorithm. We empirically analyze AMG's vulnerability to hardware faults in both sequential and parallel (hybrid MPI/OpenMP) executions on up to 1,600 cores and propose and evaluate the use of targeted pointer replication to reduce it. Our techniques increase AMG's resilience to transient hardware faults by 50-80% and improve its scalability on faulty computational environments by 35%. Further, we show how to model AMG's scalability in fault-prone environments to predict execution times of large-scale runs accurately.