Poster: a tunable, software-based DRAM error detection and correction library for HPC

  • Authors:
  • David Fiala;Kurt Ferreira;Frank Mueller;Christian Engelmann

  • Affiliations:
  • North Carolina State University, Raleigh, NC, USA;Sandia National Laboratories, Albuquerque, NM, USA;North Carolina State University, Raleigh, NC, USA;Oak Ridge National Laboratory, Oak Ridge, TN, USA

  • Venue:
  • Proceedings of the 2011 companion on High Performance Computing Networking, Storage and Analysis Companion
  • Year:
  • 2011

Quantified Score

Hi-index 0.00

Visualization

Abstract

Proposed exascale systems will present a number of considerable resiliency challenges. In particular, DRAM soft-errors, or bit-flips, are expected to greatly increase due to the increased memory density of these systems. Current hardware-based fault-tolerance methods will be unsuitable for addressing the expected soft error frequency rate. As a result, additional software will be needed to address this challenge. In this paper we introduce LIBSDC, a tunable, transparent silent data corruption detection and correction library for HPC applications. LIBSDC provides comprehensive SDC protection for program memory by implementing on-demand page integrity verification by utilizing the MMU. Experimental benchmarks with Mantevo HPCCG show that once tuned, LIBSDC is able to achieve SDC protection with less than 100% overhead of resources.