RISE: improving the streaming processors reliability against soft errors in gpgpus

  • Authors:
  • Jingweijia Tan;Xin Fu

  • Affiliations:
  • University of Kansas, Lawrence, KS, USA;University of Kansas, Lawrence, KS, USA

  • Venue:
  • Proceedings of the 21st international conference on Parallel architectures and compilation techniques
  • Year:
  • 2012

Quantified Score

Hi-index 0.00

Visualization

Abstract

With hundreds of cores integrated into a single chip, the general-purpose computing on graphic processing units (GPGPUs) provide high computing power to accelerate parallel applications. However, they are prone to manifest high soft-error vulnerability due to the lack of fault detection and tolerance. Especially, streaming processors become the reliability hot-spot in GPGPUs. This paper explores two opportunistic soft-error detection techniques to cost-effectively improve the streaming processors reliability. Observing that the streaming processors are not fully utilized during the branch divergence and pipeline stalls caused by the long latency operations, we propose to Recycle the streaming processors Idle time for Soft-Error detection (RISE) and obtain the good fault coverage with negligible performance degradation. RISE is composed of full-RISE and partial-RISE. Full-RISE selectively triggers the redundancy for a set of warps so that leverages the fully idled streaming processors during the pipeline stall time for the error detection. Partial-RISE performs the redundancy for a number of threads in certain warps using the partially idled streaming processors during the branch divergence. Our experimental results show that RISE shows strong capability in improving the SPs soft-error reliability by 43% with negligible (e.g. 4%) performance loss.