A novel NoC-based design for fault-tolerance of last-level caches in CMPs
Proceedings of the eighth IEEE/ACM/IFIP international conference on Hardware/software codesign and system synthesis
NoC-based fault-tolerant cache design in chip multiprocessors
ACM Transactions on Embedded Computing Systems (TECS) - Special Issue on Design Challenges for Many-Core Processors, Special Section on ESTIMedia'13 and Regular Papers
Hi-index | 0.00 |
Large scale Chip-Multiprocessors (CMPs) generally employ Network-on-Chip (NoC) to connect the last level cache (LLC), which is generally organized as distributed NUCA (non-uniform cache access) arrays for scalability and efficiency. On the other hand, aggressive technology scaling induces severe reliability problems, causing on-chip components (e.g., cores, cache banks, routers) failure due to manufacture defects or on-line hardware faults. Typical degradable CMPs should possess the ability to work around defects by disabling faulty components. For static NUCA architecture, when cache banks attached to a computing node are disabled, however, certain physical address sections will no longer be accessible. Prior approaches such as sets reduction introduced in Intel Xeon processor 7100 series enable turning off cache banks by masking certain sets bits in physical address1, which greatly wastes cache capacity. In this paper, we propose to tackle the above problem in a finer granularity to restrict the capacity loss in NUCA cache. Cache accesses to isolated nodes are redirected based on the utility-driven address remapping scheme that reduces data blocks conflicts in fault-tolerant shared-LLC. We evaluate our technique using GEMS simulator. Experimental results show that address remapping achieves significant improvement over the conventional cache sizing scheme.