Unimodular transformations of non-perfectly nested loops
Parallel Computing
Advanced compiler design and implementation
Advanced compiler design and implementation
High Performance Compilers for Parallel Computing
High Performance Compilers for Parallel Computing
Loop Parallelization in the Polytope Model
CONCUR '93 Proceedings of the 4th International Conference on Concurrency Theory
SPARK: A High-Lev l Synthesis Framework For Applying Parallelizing Compiler Transformations
VLSID '03 Proceedings of the 16th International Conference on VLSI Design
ASAP '04 Proceedings of the Application-Specific Systems, Architectures and Processors, 15th IEEE International Conference
Expression Synthesis in Process Networks generated by LAURA
ASAP '05 Proceedings of the 2005 IEEE International Conference on Application-Specific Systems, Architecture Processors
Hierarchical Partitioning for Piecewise Linear Algorithms
PARELEC '06 Proceedings of the international symposium on Parallel Computing in Electrical Engineering
The impact of loop unrolling on controller delay in high level synthesis
Proceedings of the conference on Design, automation and test in Europe
PARO: Synthesis of Hardware Accelerators for Multi-dimensional Dataflow-Intensive Applications
ARC '08 Proceedings of the 4th international workshop on Reconfigurable Computing: Architectures, Tools and Applications
Architecture exploration for efficient data transfer and storage in data-parallel applications
EuroPar'10 Proceedings of the 16th international Euro-Par conference on Parallel processing: Part I
The benefits of using variable-length pipelined operations in high-level synthesis
ACM Transactions on Embedded Computing Systems (TECS)
Hi-index | 0.00 |
State-of-the-art behavioral synthesis tools barely have high-level transformations in order to achieve highly parallelized implementations. If any, they apply loop unrolling to obtain a higher throughput. In this paper, we employ the PARO behavioral synthesis tool which has the unique ability to perform both loop unrolling or loop partitioning. Loop unrolling replicates the loop kernel and exposes the parallelism for hardware implementation, whereas partitioning tiles the loop program onto a regular array consisting of tightly coupled processing elements. The usage of the same design tool for both the variants enables for the first time, a quantitative evaluation of the two approaches for reconfigurable architectures with help of computationally intensive algorithms selected from different benchmarks. Superlinear speedups in terms of throughput are accomplished for the processor array approach. In addition, area and power cost are reduced.