NoC-based fault-tolerant cache design in chip multiprocessors

  • Authors:
  • Abbas Banaiyanmofrad;Gustavo Girão;Nikil Dutt

  • Affiliations:
  • University of California, Irvine, CA;Federal University of Rio Grande do Sul, Porto Alegre, Brazil;University of California, Irvine, CA

  • Venue:
  • ACM Transactions on Embedded Computing Systems (TECS) - Special Issue on Design Challenges for Many-Core Processors, Special Section on ESTIMedia'13 and Regular Papers
  • Year:
  • 2014

Quantified Score

Hi-index 0.00

Visualization

Abstract

Advances in technology scaling increasingly make emerging Chip MultiProcessor (CMP) platforms more susceptible to failures that cause various reliability challenges. In such platforms, error-prone on-chip memories (caches) continue to dominate the chip area. Also, Network-on-Chip (NoC) fabrics are increasingly used to manage the scalability of these architectures. We present a novel solution for efficient implementation of fault-tolerant design of Last-Level Cache (LLC) in CMP architectures. The proposed approach leverages the interconnection network fabric to protect the LLC cache banks against permanent faults in an efficient and scalable way. During an LLC access to a faulty block, the network detects and corrects the faults, returning the fault-free data to the requesting core. Leveraging the NoC interconnection fabric, designers can implement any cache fault-tolerant scheme in an efficient, modular, and scalable manner for emerging multicore/manycore platforms. We propose four different policies for implementing a remapping-based fault-tolerant scheme leveraging the NoC fabric in different settings. The proposed policies enable design trade-offs between NoC traffic (packets sent through the network) and the intrinsic parallelism of these communication mechanisms, allowing designers to tune the system based on design constraints. We perform an extensive design space exploration on NoC benchmarks to demonstrate the usability and efficacy of our approach. In addition, we perform sensitivity analysis to observe the behavior of various policies in reaction to improvements in the NoC architecture. The overheads of leveraging the NoC fabric are minimal: on an 8-core, 16-cache-bank CMP we demonstrate reliable access to LLCs with additional overheads of less than 3% in area and less than 7% in power.