An adaptive low-overhead mechanism for dependable general-purpose many-core processors

Authors:
Wentao Jia;Rui Li;Chunyan Zhang
Affiliations:
School of Computer, National University of Defense Technology, China;School of Computer, National University of Defense Technology, China;School of Computer, National University of Defense Technology, China
Venue:
ICT-EurAsia'13 Proceedings of the 2013 international conference on Information and Communication Technology
Year:
2013

Citing 6
Cited 0

The Impact of Technology Scaling on Lifetime Reliability

DSN '04 Proceedings of the 2004 International Conference on Dependable Systems and Networks
Fingerprinting: bounding soft-error detection latency and bandwidth

ASPLOS XI Proceedings of the 11th international conference on Architectural support for programming languages and operating systems
Utilizing Dynamically Coupled Cores to Form a Resilient Chip Multiprocessor

DSN '07 Proceedings of the 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks
Thousand core chips: a technology perspective

Proceedings of the 44th annual Design Automation Conference
Mixed-mode multicore reliability

Proceedings of the 14th international conference on Architectural support for programming languages and operating systems
Relax: an architectural framework for software recovery of hardware faults

Proceedings of the 37th annual international symposium on Computer architecture

Quantified Score

Hi-index	0.00

Visualization

Abstract

Future many-core processors may contain more than 1000 cores on single die. However, continued scaling of silicon fabrication technology exposes chip orders of such magnitude to a higher vulnerability to errors. A low-overhead and adaptive fault-tolerance mechanism is desired for general-purpose many-core processors. We propose high-level adaptive redundancy (HLAR), which possesses several unique properties. First, the technique employs selective redundancy based application assistance and dynamically cores schedule. Second, the method requires minimal overhead when the mechanism is disabled. Third, it expands the local memory within the replication sphere, which heightens the replication level and simplifies the redundancy mechanism. Finally, it decreases bandwidth through various compression methods, thus effectively balancing reliability, performance, and power. Experimental results show a remarkably low overhead while covering 99.999% errors with only 0.25% more networks-on-chip traffic.