Fault Tolerance Techniques for the Merrimac Streaming Supercomputer

Authors:
Mattan Erez;Nuwan Jayasena;Timothy J. Knight;William J. Dally
Affiliations:
Stanford University;Stanford University;Stanford University;Stanford University
Venue:
SC '05 Proceedings of the 2005 ACM/IEEE conference on Supercomputing
Year:
2005

Citing 30
Cited 2

Fault-Tolerant FFT Networks

IEEE Transactions on Computers
Compiler-Assisted Synthesis of Algorithm-Based Checking in Multiprocessors

IEEE Transactions on Computers
Algorithm-Based Fault Detection for Signal Processing Applications

IEEE Transactions on Computers
Algorithm-Based Fault Tolerance on a Hypercube Multiprocessor

IEEE Transactions on Computers
Reliable computer systems (3rd ed.): design and evaluation

Reliable computer systems (3rd ed.): design and evaluation
DIVA: a reliable substrate for deep submicron microarchitecture design

Proceedings of the 32nd annual ACM/IEEE international symposium on Microarchitecture
ED4I: Error Detection by Diverse Data and Duplicated Instructions

IEEE Transactions on Computers - Special issue on fault-tolerant embedded systems
IBM's S/390 G5 Microprocessor Design

IEEE Micro
Performance Evaluation of Checksum-Based ABFT

DFT '01 Proceedings of the 16th IEEE International Symposium on Defect and Fault-Tolerance in VLSI Systems
Modeling the Effect of Technology Trends on the Soft Error Rate of Combinational Logic

DSN '02 Proceedings of the 2002 International Conference on Dependable Systems and Networks
A 1.3GHz fifth generation SPARC64 microprocessor

Proceedings of the 40th annual Design Automation Conference
Experimental evaluation of the fail-silent behaviour in programs with consistency checks

FTCS '96 Proceedings of the The Twenty-Sixth Annual International Symposium on Fault-Tolerant Computing (FTCS '96)
AR-SMT: A Microarchitectural Approach to Fault Tolerance in Microprocessors

FTCS '99 Proceedings of the Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing
The design, analysis, and verification of the SIFT fault tolerant system

ICSE '76 Proceedings of the 2nd international conference on Software engineering
Exploring the VLSI Scalability of Stream Processors

HPCA '03 Proceedings of the 9th International Symposium on High-Performance Computer Architecture
A Software Fault Tolerance Method for Safety-Critical Systems: Effectiveness and Drawbacks

Proceedings of the 15th symposium on Integrated circuits and systems design
Transient-fault recovery for chip multiprocessors

Proceedings of the 30th annual international symposium on Computer architecture
Exploiting ILP, TLP, and DLP with the polymorphous TRIPS architecture

Proceedings of the 30th annual international symposium on Computer architecture
Programmable Stream Processors

Computer
An Algorithm-Based Error Detection Scheme for the Multigrid Method

IEEE Transactions on Computers
A Systematic Methodology to Compute the Architectural Vulnerability Factors for a High-Performance Microprocessor

Proceedings of the 36th annual IEEE/ACM International Symposium on Microarchitecture
The vlsi implementation and evaluation of area- and energy-efficient streaming media processors

The vlsi implementation and evaluation of area- and energy-efficient streaming media processors
Characterizing the Effects of Transient Faults on a High-Performance Processor Pipeline

DSN '04 Proceedings of the 2004 International Conference on Dependable Systems and Networks
Fingerprinting: bounding soft-error detection latency and bandwidth

ASPLOS XI Proceedings of the 11th international conference on Architectural support for programming languages and operating systems
The Soft Error Problem: An Architectural Perspective

HPCA '05 Proceedings of the 11th International Symposium on High-Performance Computer Architecture
Robust System Design with Built-In Soft-Error Resilience

Computer
SWIFT: Software Implemented Fault Tolerance

Proceedings of the international symposium on Code generation and optimization
Analysis and Performance Results of a Molecular Modeling Application on Merrimac

Proceedings of the 2004 ACM/IEEE conference on Supercomputing
Merrimac: Supercomputing with Streams

Proceedings of the 2003 ACM/IEEE conference on Supercomputing
GIPSE: Streamlining the Management of Simulation on the Grid

ANSS '05 Proceedings of the 38th annual Symposium on Simulation

Framework for enabling highly available distributed applications for utility computing

ISPA'06 Proceedings of the 4th international conference on Parallel and Distributed Processing and Applications
Rethinking algorithm-based fault tolerance with a cooperative software-hardware approach

SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis

Quantified Score

Hi-index	0.00

Visualization

Abstract

As device scales shrink, higher transistor counts are available while soft-errors, even in logic, become a major concern. A new class of architectures, such as Merrimac and the IBM Cell, take advantage of the higher transistor count by exposing control, communication, and a large number of functional-units at the architectural level, thus achieving high performance and efficiency. This paper explores soft-error fault tolerance in the context of these computeintensive architectures, which differ significantly from their control-intensive CPU counterparts. The main goal of the proposed schemes for Merrimac is to conserve the critical and costly off-chip bandwidth and on-chip storage resources, while maintaining high peak and sustained performance. We achieve this by allowing for reconfigurability and relying on programmer input. The processor is either run at full peak performance employing software fault-tolerance methods, or reduced performance with hardware redundancy. We present several methods, their analysis, and detailed case studies.