Parallelization Approaches for Hardware Accelerators --- Loop Unrolling Versus Loop Partitioning

Authors:
Frank Hannig;Hritam Dutta;Jürgen Teich
Affiliations:
Hardware/Software Co-Design, Department of Computer Science, University of Erlangen-Nuremberg, Germany;Hardware/Software Co-Design, Department of Computer Science, University of Erlangen-Nuremberg, Germany;Hardware/Software Co-Design, Department of Computer Science, University of Erlangen-Nuremberg, Germany
Venue:
ARCS '09 Proceedings of the 22nd International Conference on Architecture of Computing Systems
Year:
2009

Citing 10
Cited 2

Unimodular transformations of non-perfectly nested loops

Parallel Computing
Advanced compiler design and implementation

Advanced compiler design and implementation
High Performance Compilers for Parallel Computing

High Performance Compilers for Parallel Computing
Loop Parallelization in the Polytope Model

CONCUR '93 Proceedings of the 4th International Conference on Concurrency Theory
SPARK: A High-Lev l Synthesis Framework For Applying Parallelizing Compiler Transformations

VLSID '03 Proceedings of the 16th International Conference on VLSI Design
Resource Constrained and Speculative Scheduling of an Algorithm Class with Run-Time Dependent Conditionals

ASAP '04 Proceedings of the Application-Specific Systems, Architectures and Processors, 15th IEEE International Conference
Expression Synthesis in Process Networks generated by LAURA

ASAP '05 Proceedings of the 2005 IEEE International Conference on Application-Specific Systems, Architecture Processors
Hierarchical Partitioning for Piecewise Linear Algorithms

PARELEC '06 Proceedings of the international symposium on Parallel Computing in Electrical Engineering
The impact of loop unrolling on controller delay in high level synthesis

Proceedings of the conference on Design, automation and test in Europe
PARO: Synthesis of Hardware Accelerators for Multi-dimensional Dataflow-Intensive Applications

ARC '08 Proceedings of the 4th international workshop on Reconfigurable Computing: Architectures, Tools and Applications

Architecture exploration for efficient data transfer and storage in data-parallel applications

EuroPar'10 Proceedings of the 16th international Euro-Par conference on Parallel processing: Part I
The benefits of using variable-length pipelined operations in high-level synthesis

ACM Transactions on Embedded Computing Systems (TECS)

Quantified Score

Hi-index	0.00

Visualization

Abstract

State-of-the-art behavioral synthesis tools barely have high-level transformations in order to achieve highly parallelized implementations. If any, they apply loop unrolling to obtain a higher throughput. In this paper, we employ the PARO behavioral synthesis tool which has the unique ability to perform both loop unrolling or loop partitioning. Loop unrolling replicates the loop kernel and exposes the parallelism for hardware implementation, whereas partitioning tiles the loop program onto a regular array consisting of tightly coupled processing elements. The usage of the same design tool for both the variants enables for the first time, a quantitative evaluation of the two approaches for reconfigurable architectures with help of computationally intensive algorithms selected from different benchmarks. Superlinear speedups in terms of throughput are accomplished for the processor array approach. In addition, area and power cost are reduced.