Loop Transformations for Fault Detection in Regular Loops on Massively Parallel Systems

Authors:
Chun Gong;Rami Melhem;Rajiv Gupta
Affiliations:
-;-;-
Venue:
IEEE Transactions on Parallel and Distributed Systems
Year:
1996

Citing 13
Cited 6

Roving Emulation as a Fault Detection Mechanism

IEEE Transactions on Computers
Compiling C* programs for a hypercube multicomputer

PPEALS '88 Proceedings of the ACM/SIGPLAN conference on Parallel programming: experience with applications, languages and systems
Spare Capacity as a Means of Fault Detection and Diagnosis in Multiprocessor Systems

IEEE Transactions on Computers
Performance Analysis of a Generalized Concurrent Error Detection Procedure

IEEE Transactions on Computers
Compiler-Assisted Synthesis of Algorithm-Based Checking in Multiprocessors

IEEE Transactions on Computers
Supporting shared data structures on distributed memory architectures

PPOPP '90 Proceedings of the second ACM SIGPLAN symposium on Principles & practice of parallel programming
Global optimizations for parallelism and locality on scalable parallel machines

PLDI '93 Proceedings of the ACM SIGPLAN 1993 conference on Programming language design and implementation
High performance Fortran without templates: an alternative model for distribution and alignment.

PPOPP '93 Proceedings of the fourth ACM SIGPLAN symposium on Principles and practice of parallel programming
Compiler-assisted approaches to fault detection on distributed-memory systems

Compiler-assisted approaches to fault detection on distributed-memory systems
Replicated distributed programs

Proceedings of the tenth ACM symposium on Operating systems principles
Optimizing Supercompilers for Supercomputers

Optimizing Supercompilers for Supercomputers
Low Overhead Multiprocessor Allocation Strategies Exploiting System Spare Capacity for Fault Detection and Location

IEEE Transactions on Computers
An Overview of the Fortran D Programming System

Proceedings of the Fourth International Workshop on Languages and Compilers for Parallel Computing

Predictable execution adaptivity through embedding dynamic reconfigurability into static MPSoC schedules

CODES+ISSS '07 Proceedings of the 5th IEEE/ACM international conference on Hardware/software codesign and system synthesis
Transformations techniques for extracting parallelism in non-uniform nested loops

WSEAS Transactions on Computers
Affine and unimodular transformations for non-uniform nested loops

ICCOMP'08 Proceedings of the 12th WSEAS international conference on Computers
Improving chip multiprocessor reliability through code replication

Computers and Electrical Engineering
A task remapping technique for reliable multi-core embedded systems

CODES/ISSS '10 Proceedings of the eighth IEEE/ACM/IFIP international conference on Hardware/software codesign and system synthesis
Memory space conscious loop iteration duplication for reliable execution

SAS'05 Proceedings of the 12th international conference on Static Analysis

Quantified Score

Hi-index	0.00

Visualization

Abstract

Distributed-memory systems can incorporate thousands of processors at a reasonable cost. However, with an increasing number of processors in a system, fault detection and fault tolerance become critical issues. By replicating the computation on more than one processor and comparing the results produced by these processors, errors can be detected. During the execution of a program, due to data dependencies, typically not all of the processors in a multiprocessor system are busy at all times. Therefore processor schedules contain idle time slots and it is the goal of this work to exploit these idle time slots to schedule duplicated computation for the purpose of fault detection. We propose a compiler-assisted approach to fault detection in regular loops on distributed-memory systems. This approach achieves fault detection by duplicating the execution of statement instances. After carefully analyzing the data dependencies of a regular loop, selected instances of loop statements are duplicated in a way that ensures the desired fault coverage. We first present duplication strategies for fault detection and show that these strategies use idle processor times for executing replicated statements, whenever possible. Next, we present loop transformations to implement these fault-detection strategies. Also, a general framework for selecting appropriate loop transformations is developed. Experimental results performed on the CRAY-T3D show that the overhead of adding the fault detection capability is usually less than 25%, and is less than 10% when communication overhead is reduced by grouping messages.