Node-covering, Error-correcting Codes and Multiprocessors with Very High Average Fault Tolerance

Authors:
Shantanu Dutt;Nihar R. Mahapatra
Affiliations:
Univ. of Illinois at Chicago, Chicago;State Univ. of New York at Buffalo, Buffalo
Venue:
IEEE Transactions on Computers
Year:
1997

Citing 11
Cited 8

VLSI array processors

VLSI array processors
Failure correction techniques for large disk arrays

ASPLOS III Proceedings of the third international conference on Architectural support for programming languages and operating systems
Efficient Algorithms for Reconfiguration in VLSI/WSI Arrays

IEEE Transactions on Computers
On Designing and Reconfiguring k-Fault-Tolerant Tree Architectures

IEEE Transactions on Computers
Introduction to algorithms

Introduction to algorithms
Designing fault-tolerant systems using automorphisms

Journal of Parallel and Distributed Computing
Some Practical Issues in the Design of Fault-Tolerant Multiprocessors

IEEE Transactions on Computers - Special issue on fault-tolerant computing
Fault-tolerant meshes with small degree

SPAA '93 Proceedings of the fifth annual ACM symposium on Parallel algorithms and architectures
Construction of the mesh and the torus tolerating a large number of faults

SPAA '94 Proceedings of the sixth annual ACM symposium on Parallel algorithms and architectures
Fault-Tolerant Meshes and Hypercubes with Minimal Numbers of Spares

IEEE Transactions on Computers
Reconfiguring Processor Arrays Using Multiple-Track Models: The 3Track-Spare-Approach

IEEE Transactions on Computers

Methodologies for Tolerating Cell and Interconnect Faults in FPGAs

IEEE Transactions on Computers
Embryonics: A Bio-Inspired Cellular Architecture with Fault-Tolerant Properties

Genetic Programming and Evolvable Machines
Node Covering, Error Correcting Codes and Multiprocessors with Very High Average Fault Tolerance

FTCS '95 Proceedings of the Twenty-Fifth International Symposium on Fault-Tolerant Computing
Minimized Embedding of Arbitrary Hamiltonian Graphs in Fault-tolerant Graph and Reconfiguration at Faults. I. One-fault-tolerant Structures

Automation and Remote Control
Trust-Based Design and Check of FPGA Circuits Using Two-Level Randomized ECC Structures

ACM Transactions on Reconfigurable Technology and Systems (TRETS)
Designing and embedding reliable virtual infrastructures

Proceedings of the second ACM SIGCOMM workshop on Virtualized infrastructure systems and architectures
Designing and embedding reliable virtual infrastructures

ACM SIGCOMM Computer Communication Review
A new immunotronic approach to hardware fault detection using symbiotic evolution

IWINAC'05 Proceedings of the First international work-conference on the Interplay Between Natural and Artificial Computation conference on Artificial Intelligence and Knowledge Engineering Applications: a bioinspired approach - Volume Part II

Quantified Score

Hi-index	14.99

Visualization

Abstract

Structural fault tolerance (SFT) is the ability of a multiprocessor to reconfigure around faulty processors or links in order to preserve its original processor interconnection structure. In this paper, we focus on the design of SFT multiprocessors that have low switch and link overheads, but can tolerate a very large number of processor faults on the average. Most previous work has concentrated on deterministic k-fault-tolerant (k-FT) designs in which exactly k spare processors and some spare switches and links are added to construct multiprocessors that can tolerate any k processor faults. However, after k faults are reconfigured around, much of the extra links and switches can remain unutilized. It is possible within the basic node-covering framework, which was introduced by Dutt and Hayes as an efficient k-FT design method, to design FT multiprocessors that have the same amount of switches and links as, say, a two-FT deterministic design, but have s spare processors, where $s \gg 2,$ so that, on the average, k = 驴(s) (k驴s) processor failures can be reconfigured around. Such designs utilize the spare link and switch capacity very efficiently, and are called probabilistic FT designs. An elegant and powerful method to construct covering graphs or CG's, which are key to obtaining the probabilistic FT designs, is to use linear error-correcting codes (ECCs). We show how to construct probabilistic designs with very high average fault tolerance but low wiring and switch overhead using ECCs like the 2D-parity, full-two, 3D-parity, and full-three codes. This design methodology is applicable to any multiprocessor interconnection topology and the resulting FT designs have the same node degree as the non-FT target topology. We also analyze the deterministic fault tolerance for these designs and develop efficient layout strategies for them. Finally, we compare the proposed probabilistic designs to some of the best deterministic and probabilistic designs proposed in the past, and show that our designs can meet a given mean-time-to-failure (MTTF) specification at much lower hardware costs (switch complexity, redundant wiring area, and spare-processor overhead) than previous designs. Further, for a given number of spare processors, our designs have close-to-optimal reconfigurabilities that are much better than those of previous probabilistic designs.