Fault-tolerant algorithms for multiple processor systems
Fault-tolerant algorithms for multiple processor systems
Error Correction by Alternate-Data Retry
IEEE Transactions on Computers
Design of Self-Checking MOS-LSI Circuits: Application to a Four-Bit Microprocessor
IEEE Transactions on Computers
Watchdog Processors and Structural Integrity Checking
IEEE Transactions on Computers
Concurrent Error Detection in ALU's by Recomputing with Shifted Operands
IEEE Transactions on Computers
Design of a Massively Parallel Processor
IEEE Transactions on Computers
Fault Detection Capabilities of Alternating Logic
IEEE Transactions on Computers
Optimal rectangular code for high density magnetic tapes
IBM Journal of Research and Development
Bounds on Algorithm-Based Fault Tolerance in Multiple Processor Systems
IEEE Transactions on Computers - The MIT Press scientific computation series
Fault-Detection by Result-Checking for the Eigenproblem
EDCC-3 Proceedings of the Third European Dependable Computing Conference on Dependable Computing
SAFECOMP '01 Proceedings of the 20th International Conference on Computer Safety, Reliability and Security
Fault-Tolerant High-Performance Matrix Multiplication: Theory and Practice
DSN '01 Proceedings of the 2001 International Conference on Dependable Systems and Networks (formerly: FTCS)
DSN '01 Proceedings of the 2001 International Conference on Dependable Systems and Networks (formerly: FTCS)
Low Cost Concurrent Test Implementation for Linear Digital Systems
ETW '00 Proceedings of the IEEE European Test Workshop
Cost analysis of a new algorithmic-based soft-error tolerant architecture
DFT '95 Proceedings of the IEEE International Workshop on Defect and Fault Tolerance in VLSI Systems
Fault tolerant matrix operations using checksum and reverse computation
FRONTIERS '96 Proceedings of the 6th Symposium on the Frontiers of Massively Parallel Computation
Experimental evaluation of the fail-silent behaviour in programs with consistency checks
FTCS '96 Proceedings of the The Twenty-Sixth Annual International Symposium on Fault-Tolerant Computing (FTCS '96)
Compiler-assisted generation of error-detecting parallel programs
FTCS '96 Proceedings of the The Twenty-Sixth Annual International Symposium on Fault-Tolerant Computing (FTCS '96)
Fault Tolerant Matrix Operations for Networks of Workstations Using Multiple Checkpointing
HPC-ASIA '97 Proceedings of the High-Performance Computing on the Information Superhighway, HPC-Asia '97
Low-cost DC built-in self-test of linear analog circuits using checksums
VLSID '96 Proceedings of the 9th International Conference on VLSI Design: VLSI in Mobile Communication
Optimal Design of Checksum-Based Checkers for Fault Detection in Linear Analog Circuits
VLSID '97 Proceedings of the Tenth International Conference on VLSI Design: VLSI in Multimedia Applications
Software Development Kit for Dependable Applications in Embedded
ITC '00 Proceedings of the 2000 IEEE International Test Conference
Predicting Device Performance From Pass/Fail Transient Signal Analysis Data
ITC '00 Proceedings of the 2000 IEEE International Test Conference
Analytical Redundancy Based Approach for Concurrent Fault Detection in Linear Digital Systems
IOLTW '00 Proceedings of the 6th IEEE International On-Line Testing Workshop (IOLTW)
New Techniques for Accelerating Fault Injection in VHDL Descriptions
IOLTW '00 Proceedings of the 6th IEEE International On-Line Testing Workshop (IOLTW)
IOLTW '00 Proceedings of the 6th IEEE International On-Line Testing Workshop (IOLTW)
Method for designing and placing check sets based on control flow analysis of programs
ISSRE '96 Proceedings of the The Seventh International Symposium on Software Reliability Engineering
Algorithm-Based Diskless Checkpointing for Fault-Tolerant Matrix Operations
FTCS '95 Proceedings of the Twenty-Fifth International Symposium on Fault-Tolerant Computing
Feasibility and Effectiveness of the Algorithm for Overhead Reduction in Analog Checkers
FTCS '95 Proceedings of the Twenty-Fifth International Symposium on Fault-Tolerant Computing
Checking the Integrity of Trees
FTCS '95 Proceedings of the Twenty-Fifth International Symposium on Fault-Tolerant Computing
On-Line Error Monitoring for Several Data Structures
FTCS '95 Proceedings of the Twenty-Fifth International Symposium on Fault-Tolerant Computing
On-Line Fault Detection In DSP Circuits Using Extrapolated Checksums with Minimal Test Points
ITC '99 Proceedings of the 1999 IEEE International Test Conference
NetSolve/D: A Massively Parallel Grid Execution System for Scalable Data Intensive Collaboration
IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Workshop 10 - Volume 11
A New Hybrid Fault Detection Technique for Systems-on-a-Chip
IEEE Transactions on Computers
Software-Based Adaptive and Concurrent Self-Testing in Programmable Network Interfaces
ICPADS '06 Proceedings of the 12th International Conference on Parallel and Distributed Systems - Volume 1
An optimized hybrid approach to provide fault detection and correction in SoCs
Proceedings of the 20th annual conference on Integrated circuits and systems design
Exact Fault-Sensitive Feasibility Analysis of Real-Time Tasks
IEEE Transactions on Computers
Software-Based Failure Detection and Recovery in Programmable Network Interfaces
IEEE Transactions on Parallel and Distributed Systems
Soft error vulnerability of iterative linear algebra methods
Proceedings of the 22nd annual international conference on Supercomputing
Globally optimized robust systems to overcome scaled CMOS reliability challenges
Proceedings of the conference on Design, automation and test in Europe
Algorithm-based fault tolerance applied to high performance computing
Journal of Parallel and Distributed Computing
The Optimal Architecture Design of Two-Dimension Matrix Multiplication Jumping Systolic Array
IEICE Transactions on Fundamentals of Electronics, Communications and Computer Sciences
Sequential element design with built-in soft error resilience
IEEE Transactions on Very Large Scale Integration (VLSI) Systems
An algorithm based mesh check-sum fault tolerant scheme for stream ciphers
International Journal of Communication Networks and Distributed Systems
International Journal of High Performance Computing Applications
A mesh check-sum ABFT scheme for stream ciphers
International Journal of Communication Networks and Distributed Systems
Fault Tolerant External Memory Algorithms
WADS '09 Proceedings of the 11th International Symposium on Algorithms and Data Structures
AN-Encoding Compiler: Building Safety-Critical Systems with Commodity Hardware
SAFECOMP '09 Proceedings of the 28th International Conference on Computer Safety, Reliability, and Security
International Journal of High Performance Computing Applications
Optimal real number codes for fault tolerant matrix operations
Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Nonconcurrent error correction in the presence of roundoff noise
IEEE Transactions on Circuits and Systems Part I: Regular Papers
Mapping matrix multiplication algorithm onto fault-tolerant systolic array
Computers & Mathematics with Applications
Counting in the Presence of Memory Faults
ISAAC '09 Proceedings of the 20th International Symposium on Algorithms and Computation
Journal of Signal Processing Systems
Checksum-based probabilistic transient-error compensation for linear digital systems
IEEE Transactions on Very Large Scale Integration (VLSI) Systems
Constructing numerically stable real number codes using evolutionary computation
Proceedings of the 12th annual conference on Genetic and evolutionary computation
Output-sensitive decoding for redundant residue systems
Proceedings of the 2010 International Symposium on Symbolic and Algebraic Computation
HotDep'08 Proceedings of the Fourth conference on Hot topics in system dependability
Efficient soft error-tolerant adaptive equalizers
IEEE Transactions on Circuits and Systems Part I: Regular Papers
Design techniques for cross-layer resilience
Proceedings of the Conference on Design, Automation and Test in Europe
Cross-layer resilience challenges: metrics and optimization
Proceedings of the Conference on Design, Automation and Test in Europe
ERSA: error resilient system architecture for probabilistic applications
Proceedings of the Conference on Design, Automation and Test in Europe
IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Algorithm-based recovery for HPL
Proceedings of the 16th ACM symposium on Principles and practice of parallel programming
IEEE Transactions on Very Large Scale Integration (VLSI) Systems
High performance linpack benchmark: a fault tolerant implementation without checkpointing
Proceedings of the international conference on Supercomputing
SRC: soft error detection and recovery for high performance linpack
Proceedings of the international conference on Supercomputing
Algorithm-based recovery for iterative methods without checkpointing
Proceedings of the 20th international symposium on High performance distributed computing
Tolerating correlated failures for generalized Cartesian distributions via bipartite matching
Proceedings of the 8th ACM International Conference on Computing Frontiers
Exploring the Limitations of Software-based Techniques in SEE Fault Coverage
Journal of Electronic Testing: Theory and Applications
A log-scaling fault tolerant agreement algorithm for a fault tolerant MPI
EuroMPI'11 Proceedings of the 18th European MPI Users' Group conference on Recent advances in the message passing interface
Run-through stabilization: an MPI proposal for process fault tolerance
EuroMPI'11 Proceedings of the 18th European MPI Users' Group conference on Recent advances in the message passing interface
Analyzing fault aware collective performance in a process fault tolerant MPI
Parallel Computing
Experimental study of resilient algorithms and data structures
SEA'10 Proceedings of the 9th international conference on Experimental Algorithms
Robust distributed orthogonalization based on randomized aggregation
Proceedings of the second workshop on Scalable algorithms for large-scale systems
Soft error resilient QR factorization for hybrid system with GPGPU
Proceedings of the second workshop on Scalable algorithms for large-scale systems
Fault tolerant matrix-matrix multiplication: correcting soft errors on-line
Proceedings of the second workshop on Scalable algorithms for large-scale systems
Algorithm-based fault tolerance for dense matrix factorizations
Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming
Memory space conscious loop iteration duplication for reliable execution
SAS'05 Proceedings of the 12th international conference on Static Analysis
Resilient algorithms and data structures
CIAC'10 Proceedings of the 7th international conference on Algorithms and Complexity
Higher dependability and security for mobile applications
SPC'06 Proceedings of the Third international conference on Security in Pervasive Computing
Operating system support to detect application hangs
VECoS'08 Proceedings of the Second international conference on Verification and Evaluation of Computer and Communication Systems
On software design for stochastic processors
Proceedings of the 49th Annual Design Automation Conference
Cooperative Application/OS DRAM fault recovery
Euro-Par'11 Proceedings of the 2011 international conference on Parallel Processing - Volume 2
A tunable, software-based DRAM error detection and correction library for HPC
Euro-Par'11 Proceedings of the 2011 international conference on Parallel Processing - Volume 2
Periodic and non-concurrent error detection and identification in one-hot encoded FSMs
Automatica (Journal of IFAC)
Error detection and correction in switched linear controllers via periodic and non-concurrent checks
Automatica (Journal of IFAC)
A class of fault-tolerant systolic arrays for matrix multiplication
Mathematical and Computer Modelling: An International Journal
Fault tolerant preconditioned conjugate gradient for sparse linear system solution
Proceedings of the 26th ACM international conference on Supercomputing
Data-driven fault tolerance for work stealing computations
Proceedings of the 26th ACM international conference on Supercomputing
Fault resilience of the algebraic multi-grid solver
Proceedings of the 26th ACM international conference on Supercomputing
Evaluating operating system vulnerability to memory errors
Proceedings of the 2nd International Workshop on Runtime and Operating Systems for Supercomputers
Time-Constraint-Aware Optimization of Assertions in Embedded Software
Journal of Electronic Testing: Theory and Applications
SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Software encoded processing: building dependable systems with commodity hardware
SAFECOMP'07 Proceedings of the 26th international conference on Computer Safety, Reliability, and Security
ACM Transactions on Reconfigurable Technology and Systems (TRETS)
Priority queues resilient to memory faults
WADS'07 Proceedings of the 10th international conference on Algorithms and Data Structures
Convergence analysis of evolutionary algorithms in the presence of crash-faults and cheaters
Computers & Mathematics with Applications
A checkpoint-on-failure protocol for algorithm-based recovery in standard MPI
Euro-Par'12 Proceedings of the 18th international conference on Parallel Processing
An evaluation of user-level failure mitigation support in MPI
EuroMPI'12 Proceedings of the 19th European conference on Recent Advances in the Message Passing Interface
User level failure mitigation in MPI
Euro-Par'12 Proceedings of the 18th international conference on Parallel processing workshops
Concurrent and comparative fault simulation in SystemC and its application in robustness evaluation
Microprocessors & Microsystems
Correcting soft errors online in LU factorization
Proceedings of the 22nd international symposium on High-performance parallel and distributed computing
Proceedings of the 3rd Workshop on Fault-tolerance for HPC at extreme scale
When is multi-version checkpointing needed?
Proceedings of the 3rd Workshop on Fault-tolerance for HPC at extreme scale
Parallel reduction to hessenberg form with algorithm-based fault tolerance
SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Multi-criteria checkpointing strategies: response-time versus resource utilization
Euro-Par'13 Proceedings of the 19th international conference on Parallel Processing
CPU-GPU hybrid bidiagonal reduction with soft error resilience
ScalA '13 Proceedings of the Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems
A study of application-level recovery methods for transient network faults
ScalA '13 Proceedings of the Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems
Self-stabilizing iterative solvers
ScalA '13 Proceedings of the Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems
Automated Algorithmic Error Resilience for Structured Grid Problems Based on Outlier Detection
Proceedings of Annual IEEE/ACM International Symposium on Code Generation and Optimization
Detecting silent data corruption through data dynamic monitoring for scientific applications
Proceedings of the 19th ACM SIGPLAN symposium on Principles and practice of parallel programming
X10-FT: Transparent fault tolerance for APGAS language and runtime
Parallel Computing
Hi-index | 14.99 |
The rapid progress in VLSI technology has reduced the cost of hardware, allowing multiple copies of low-cost processors to provide a large amount of computational capability for a small cost. In addition to achieving high performance, high reliability is also important to ensure that the results of long computations are valid. This paper proposes a novel system-level method of achieving high reliability, called algorithm-based fault tolerance. The technique encodes data at a high level, and algorithms are designed to operate on encoded data and produce encoded output data. The computation tasks within an algorithm are appropriately distributed among multiple computation units for fault tolerance. The technique is applied to matrix compomations which form the heart of many computation-intensive tasks. Algorithm-based fault tolerance schemes are proposed to detect and correct errors when matrix operations such as addition, multiplication, scalar product, LU-decomposition, and transposition are performed using multiple processor systems. The method proposed can detect and correct any failure within a single processor in a multiple processor system. The number of processors needed to just detect errors in matrix multiplication is also studied.