Selective replication: A lightweight technique for soft errors

  • Authors:
  • Xavier Vera;Jaume Abella;Javier Carretero;Antonio González

  • Affiliations:
  • Intel Barcelona Research Center, Intel Labs - UPC, Barcelona, Spain;Intel Barcelona Research Center, Intel Labs - UPC, Barcelona, Spain;Intel Barcelona Research Center, Intel Labs - UPC, Barcelona, Spain;Intel Barcelona Research Center, Intel Labs - UPC, Barcelona, Spain

  • Venue:
  • ACM Transactions on Computer Systems (TOCS)
  • Year:
  • 2010

Quantified Score

Hi-index 0.00

Visualization

Abstract

Soft errors are an important challenge in contemporary microprocessors. Modern processors have caches and large memory arrays protected by parity or error detection and correction codes. However, today's failure rate is dominated by flip flops, latches, and the increasing sensitivity of combinational logic to particle strikes. Moreover, as Chip Multi-Processors (CMPs) become ubiquitous, meeting the FIT budget for new designs is becoming a major challenge. Solutions based on replicating threads have been explored deeply; however, their high cost in performance and energy make them unsuitable for current designs. Moreover, our studies based on a typical configuration for a modern processor show that focusing on the top 5 most vulnerable structures can provide up to 70% reduction in FIT rate. Therefore, full replication may overprotect the chip by reducing the FIT much below budget. We propose Selective Replication, a lightweight-reconfigurable mechanism that achieves a high FIT reduction by protecting the most vulnerable instructions with minimal performance and energy impact. Low performance degradation is achieved by not requiring additional issue slots and reissuing instructions only during the time window between when they are retirable and they actually retire. Coverage can be reconfigured online by replicating only a subset of the instructions (the most vulnerable ones). Instructions' vulnerability is estimated based on the area they occupy and the time they spend in the issue queue. By changing the vulnerability threshold, we can adjust the trade-off between coverage and performance loss. Results for an out-of-order processor configured similarly to Intel® Core™ Micro-Architecture show that our scheme can achieve over 65% FIT reduction with less than 4% performance degradation with small area and complexity overhead.