uDIREC: unified diagnosis and reconfiguration for frugal bypass of NoC faults

  • Authors:
  • Ritesh Parikh;Valeria Bertacco

  • Affiliations:
  • University of Michigan, Ann Arbor, MI;University of Michigan, Ann Arbor, MI

  • Venue:
  • Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture
  • Year:
  • 2013

Quantified Score

Hi-index 0.00

Visualization

Abstract

As silicon continues to scale, transistor reliability is becoming a major concern. At the same time, increasing transistor counts are causing a rapid shift towards large chip multi-processors (CMP) and system-on-chip (SoC) designs, comprising several cores and IPs communicating via a network-on-chip (NoC). As the sole medium of on-chip communication, a NoC should gracefully tolerate many permanent faults. We propose uDIREC, a unified framework for permanent fault diagnosis and subsequent reconfiguration in NoCs that provides graceful performance degradation with increasing number of faults. Upon in-field transistor failures, uDIREC leverages a fine-resolution diagnosis mechanism to disable faulty components very sparingly. At its core, uDIREC employs a novel routing algorithm to find reliable and deadlock-free routes that utilize the still-functional links in the NoC. uDIREC places no restriction on topology, router architecture and number and location of faults. Experimental results show that uDIREC, implemented in a 64-node NoC, drops 3x fewer nodes and provides 25% higher throughput (beyond 15 faults) when compared to other state-of-the-art fault-tolerance solutions. uDIREC's improvement over prior-art grows with more faults, making it a suitable NoC reliability solution for a wide range of fault rates.