A fault tolerant NoC architecture using quad-spare mesh topology and dynamic reconfiguration

  • Authors:
  • Yu Ren;Leibo Liu;Shouyi Yin;Jie Han;Qinghua Wu;Shaojun Wei

  • Affiliations:
  • Institute of Microelectronics and The National Lab for Information Science and Technology, Tsinghua University, Beijing 100084, China;Institute of Microelectronics and The National Lab for Information Science and Technology, Tsinghua University, Beijing 100084, China;Institute of Microelectronics and The National Lab for Information Science and Technology, Tsinghua University, Beijing 100084, China;ECE Department, University of Alberta, Edmonton, Canada T6G 2V4;Institute of Microelectronics and The National Lab for Information Science and Technology, Tsinghua University, Beijing 100084, China;Institute of Microelectronics and The National Lab for Information Science and Technology, Tsinghua University, Beijing 100084, China

  • Venue:
  • Journal of Systems Architecture: the EUROMICRO Journal
  • Year:
  • 2013

Quantified Score

Hi-index 0.00

Visualization

Abstract

Network-on-Chip (NoC) is widely used as a communication scheme in modern many-core systems. To guarantee the reliability of communication, effective fault tolerant techniques are critical for an NoC. In this paper, a novel fault tolerant architecture employing redundant routers is proposed to maintain the functionality of a network in the presence of failures. This architecture consists of a mesh of 2x2 router blocks with a spare router placed in the center of each block. This spare router provides a viable alternative when a router fails in a block. The proposed fault-tolerant architecture is therefore referred to as a quad-spare mesh. The quad-spare mesh can be dynamically reconfigured by changing control signals without altering the underlying topology. This dynamic reconfiguration and its corresponding routing algorithm are demonstrated in detail. Since the topology after reconfiguration is consistent with the original error-free 2D mesh, the proposed design is transparent to operating systems and application software. Experimental results show that the proposed design achieves significant improvements on reliability compared with those reported in the literature. Comparing the error-free system with a single router failure case, the throughput only decreases by 5.19% and latency increases by 2.40%, with about 45.9% hardware redundancy.