Roving Emulation as a Fault Detection Mechanism
IEEE Transactions on Computers
Compiling C* programs for a hypercube multicomputer
PPEALS '88 Proceedings of the ACM/SIGPLAN conference on Parallel programming: experience with applications, languages and systems
Spare Capacity as a Means of Fault Detection and Diagnosis in Multiprocessor Systems
IEEE Transactions on Computers
Performance Analysis of a Generalized Concurrent Error Detection Procedure
IEEE Transactions on Computers
Compiler-Assisted Synthesis of Algorithm-Based Checking in Multiprocessors
IEEE Transactions on Computers
Supporting shared data structures on distributed memory architectures
PPOPP '90 Proceedings of the second ACM SIGPLAN symposium on Principles & practice of parallel programming
Global optimizations for parallelism and locality on scalable parallel machines
PLDI '93 Proceedings of the ACM SIGPLAN 1993 conference on Programming language design and implementation
High performance Fortran without templates: an alternative model for distribution and alignment.
PPOPP '93 Proceedings of the fourth ACM SIGPLAN symposium on Principles and practice of parallel programming
Compiler-assisted approaches to fault detection on distributed-memory systems
Compiler-assisted approaches to fault detection on distributed-memory systems
Replicated distributed programs
Proceedings of the tenth ACM symposium on Operating systems principles
Optimizing Supercompilers for Supercomputers
Optimizing Supercompilers for Supercomputers
IEEE Transactions on Computers
An Overview of the Fortran D Programming System
Proceedings of the Fourth International Workshop on Languages and Compilers for Parallel Computing
CODES+ISSS '07 Proceedings of the 5th IEEE/ACM international conference on Hardware/software codesign and system synthesis
Transformations techniques for extracting parallelism in non-uniform nested loops
WSEAS Transactions on Computers
Affine and unimodular transformations for non-uniform nested loops
ICCOMP'08 Proceedings of the 12th WSEAS international conference on Computers
Improving chip multiprocessor reliability through code replication
Computers and Electrical Engineering
A task remapping technique for reliable multi-core embedded systems
CODES/ISSS '10 Proceedings of the eighth IEEE/ACM/IFIP international conference on Hardware/software codesign and system synthesis
Memory space conscious loop iteration duplication for reliable execution
SAS'05 Proceedings of the 12th international conference on Static Analysis
Hi-index | 0.00 |
Distributed-memory systems can incorporate thousands of processors at a reasonable cost. However, with an increasing number of processors in a system, fault detection and fault tolerance become critical issues. By replicating the computation on more than one processor and comparing the results produced by these processors, errors can be detected. During the execution of a program, due to data dependencies, typically not all of the processors in a multiprocessor system are busy at all times. Therefore processor schedules contain idle time slots and it is the goal of this work to exploit these idle time slots to schedule duplicated computation for the purpose of fault detection. We propose a compiler-assisted approach to fault detection in regular loops on distributed-memory systems. This approach achieves fault detection by duplicating the execution of statement instances. After carefully analyzing the data dependencies of a regular loop, selected instances of loop statements are duplicated in a way that ensures the desired fault coverage. We first present duplication strategies for fault detection and show that these strategies use idle processor times for executing replicated statements, whenever possible. Next, we present loop transformations to implement these fault-detection strategies. Also, a general framework for selecting appropriate loop transformations is developed. Experimental results performed on the CRAY-T3D show that the overhead of adding the fault detection capability is usually less than 25%, and is less than 10% when communication overhead is reduced by grouping messages.