Communications of the ACM - Special section on computer architecture
Bounds on Algorithm-Based Fault Tolerance in Multiple Processor Systems
IEEE Transactions on Computers - The MIT Press scientific computation series
Processor Control Flow Monitoring Using Signatured Instruction Streams
IEEE Transactions on Computers
Solving problems on concurrent processors. Vol. 1: General techniques and regular problems
Solving problems on concurrent processors. Vol. 1: General techniques and regular problems
Algorithm-Based Fault Detection for Signal Processing Applications
IEEE Transactions on Computers
Concurrent Error Detection Using Watchdog Processors-A Survey
IEEE Transactions on Computers
A reconfigurable and fault-tolerant VLSI multiprocessor array
ISCA '81 Proceedings of the 8th annual symposium on Computer Architecture
Fault-secure algorithms for multiple-processor systems
ISCA '84 Proceedings of the 11th annual international symposium on Computer architecture
Real-Number Codes for Fault-Tolerant Matrix Operations on Processor Arrays
IEEE Transactions on Computers
Probabilistic Evaluation of Online Checks in Fault-Tolerant Multiprocessor Systems
IEEE Transactions on Computers - Special issue on fault-tolerant computing
FERRARI: A Flexible Software-Based Fault and Error Injection System
IEEE Transactions on Computers - Special issue on fault-tolerant computing
Flexible oblivious router architecture
IBM Journal of Research and Development
Algorithm-Based Error-Detection Schemes for Iterative Solution of Partial Differential Equations
IEEE Transactions on Computers
Mantissa-Preserving Operations and Robust Algorithm-Based Fault Tolerance for Matrix Computations
IEEE Transactions on Computers
Algorithm-Based Fault Tolerant Synthesis for Linear Operations
IEEE Transactions on Computers
Algorithm-Based Fault Location and Recovery for Matrix Computations on Multiprocessor Systems
IEEE Transactions on Computers
Post-mortem black-box correctness tests for basic parallel data structures
Proceedings of the eleventh annual ACM symposium on Parallel algorithms and architectures
Using Data Flow Information to Obtain Efficient Check Sets for Algorithm-Based Fault Tolerance
International Journal of Parallel Programming
An Efficient Algorithm-Based Fault Tolerance Design Using the Weighted Data-Check Relationship
IEEE Transactions on Computers
Soft digital signal processing
IEEE Transactions on Very Large Scale Integration (VLSI) Systems - System Level Design
Diagnosability and Diagnosis of Algorithm-Based Fault-Tolerant Systems
IEEE Transactions on Computers
Reliable Floating-Point Arithmetic Algorithms for Error-Coded Operands
IEEE Transactions on Computers
A New Error Analysis Based Method for Tolerance Computation for Algorithm-Based Checks
IEEE Transactions on Computers
Fault-Detection by Result-Checking for the Eigenproblem
EDCC-3 Proceedings of the Third European Dependable Computing Conference on Dependable Computing
Compiler-assisted generation of error-detecting parallel programs
FTCS '96 Proceedings of the The Twenty-Sixth Annual International Symposium on Fault-Tolerant Computing (FTCS '96)
Method for designing and placing check sets based on control flow analysis of programs
ISSRE '96 Proceedings of the The Seventh International Symposium on Software Reliability Engineering
An Algorithm-Based Error Detection Scheme for the Multigrid Method
IEEE Transactions on Computers
Enhanced Cluster k-Ary n-Cube, A Fault-Tolerant Multiprocessor
IEEE Transactions on Computers
Journal of Parallel and Distributed Computing
An efficient reconfiguration scheme for fault-tolerant meshes
Information Sciences—Informatics and Computer Science: An International Journal
Fault Tolerance Techniques for the Merrimac Streaming Supercomputer
SC '05 Proceedings of the 2005 ACM/IEEE conference on Supercomputing
Algorithm-based fault tolerance applied to high performance computing
Journal of Parallel and Distributed Computing
Optimal real number codes for fault tolerant matrix operations
Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
An efficient reconfiguration scheme for fault-tolerant meshes
Information Sciences: an International Journal
Fault tolerance in transform-domain adaptive filters operating with real-valued signals
IEEE Transactions on Circuits and Systems Part I: Regular Papers
Constructing numerically stable real number codes using evolutionary computation
Proceedings of the 12th annual conference on Genetic and evolutionary computation
IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
High performance linpack benchmark: a fault tolerant implementation without checkpointing
Proceedings of the international conference on Supercomputing
Algorithm-based recovery for iterative methods without checkpointing
Proceedings of the 20th international symposium on High performance distributed computing
Fault tolerant preconditioned conjugate gradient for sparse linear system solution
Proceedings of the 26th ACM international conference on Supercomputing
Correcting soft errors online in LU factorization
Proceedings of the 22nd international symposium on High-performance parallel and distributed computing
A survey of checker architectures
ACM Computing Surveys (CSUR)
Hamiltonian cycles in hypercubes with faulty edges
Information Sciences: an International Journal
Hi-index | 15.02 |
The design of fault-tolerant hypercube multiprocessor architecture is discussed. The authors propose the detection and location of faulty processors concurrently with the actual execution of parallel applications on the hypercube using a novel scheme of algorithm-based error detection. System-level error detection mechanisms have been implemented for three parallel applications on a 16-processor Intel iPSC hypercube multiprocessor: matrix multiplication, Gaussian elimination, and fast Fourier transform. Schemes for other applications are under development. Extensive studies have been done of error coverage of the system-level error detection schemes in the presence of finite-precision arithmetic, which affects the system-level encodings. Two reconfiguration schemes are proposed that allow the authors to isolate and replace faulty processors with spare processors.