Automatic Compilation of Loops to Exploit Operator Parallelism on Configurable Arithmetic Logic Units

Authors:
Narasimhan Ramasubramanian;Ram Subramanian;Santosh Pande
Affiliations:
Microsoft Corp., Redmond, WA;Xilinx Inc, San Jose, CA;Georgia Institute of Technology, Atlanta
Venue:
IEEE Transactions on Parallel and Distributed Systems
Year:
2002

Citing 21
Cited 0

The program dependence graph and its use in optimization

ACM Transactions on Programming Languages and Systems (TOPLAS)
Global optimizations for parallelism and locality on scalable parallel machines

PLDI '93 Proceedings of the ACM SIGPLAN 1993 conference on Programming language design and implementation
Computer arithmetic systems: algorithms, architecture and implementation

Computer arithmetic systems: algorithms, architecture and implementation
An approach to communication-efficient data redistribution

ICS '94 Proceedings of the 8th international conference on Supercomputing
Solving linear recurrences with loop raking

Journal of Parallel and Distributed Computing
Data-centric multi-level blocking

Proceedings of the ACM SIGPLAN 1997 conference on Programming language design and implementation
Loop Transformations for Restructuring Compilers: The Foundations

Loop Transformations for Restructuring Compilers: The Foundations
High Performance Compilers for Parallel Computing

High Performance Compilers for Parallel Computing
Processor reconfiguration through instruction-set metamorphosis

Computer
Estimating and Optimizing Performance for Parallel Programs

Computer
Baring It All to Software: Raw Machines

Computer
Seeking Solutions in Configurable Computing

Computer
A Loop Transformation Theory and an Algorithm to Maximize Parallelism

IEEE Transactions on Parallel and Distributed Systems
Automatic Array Privatization

Proceedings of the 6th International Workshop on Languages and Compilers for Parallel Computing
Solving Alignment Using Elementary Linear Algebra

LCPC '94 Proceedings of the 7th International Workshop on Languages and Compilers for Parallel Computing
Instruction-Level Parallelism for Reconfigurable Computing

FPL '98 Proceedings of the 8th International Workshop on Field-Programmable Logic and Applications, From FPGAs to Computing Paradigm
Garp: a MIPS processor with a reconfigurable coprocessor

FCCM '97 Proceedings of the 5th IEEE Symposium on FPGA-Based Custom Computing Machines
Specifying and Compiling Applications for RaPiD

FCCM '98 Proceedings of the IEEE Symposium on FPGAs for Custom Computing Machines
Configuration Compression for the Xilinx XC6200 FPGA

FCCM '98 Proceedings of the IEEE Symposium on FPGAs for Custom Computing Machines
A dynamic instruction set computer

FCCM '95 Proceedings of the IEEE Symposium on FPGA's for Custom Computing Machines
(R) A Compile Time Partitioning Method for DOALL Loops on Distributed Memory Systems

ICPP '96 Proceedings of the Proceedings of the 1996 International Conference on Parallel Processing - Volume 3

Quantified Score

Hi-index	0.00

Visualization

Abstract

Configurable Arithmetic Logic Units (ALUs) offer opportunities for adapting the underlying hardware to support the varying amount of parallelism in the computation. The problem of identifying the optimal parallel configurations (a configuration is defined as a given hardware implementation of different operators along with their multiplicities) at different steps in a program is a very complex issue but, if solved, allows the power of these ALUs to be maximally used. This paper focuses on developing an automatic compilation framework for configuration analysis to exploit operator parallelism within loop nests. The focus of this work is on performing configuration analysis to minimize costly reconfiguration overheads. In our framework, we initially carry out some operator and loop transformations to expose more opportunities for configuration reuse. We then present a two pass solution. The first pass attempts to generate either maximal cutsets (a cutset is defined as a group of statements that execute under a given configuration) or maximally parallel configurations by performing an analysis on the program dependency graph (PDG) of a loop nest. The second pass analyzes the trade-offs between the costs and benefits of reconfigurations across different cutsets and attempts to eliminate the reconfiguration overheads by merging cutsets. This methodology is implemented in the SUIF compilation system and is tested using some loops extracted from Perfect benchmarks and Livermore kernels. Good speedups are obtained, showing the merit of the proposed method. The method also scales well with the loop sizes and the amount of space available on FPGAs for configurable logic.