Improving performance of nested loops on reconfigurable array processors

Authors:
Yongjoo Kim;Jongeun Lee;Toan X. Mai;Yunheung Paek
Affiliations:
Seoul National University, Korea;UNIST, South Korea;UNIST, South Korea;Seoul National University, Korea
Venue:
ACM Transactions on Architecture and Code Optimization (TACO) - HIPEAC Papers
Year:
2012

Citing 22
Cited 5

Iterative modulo scheduling: an algorithm for software pipelining loops

MICRO 27 Proceedings of the 27th annual international symposium on Microarchitecture
Advanced compiler design and implementation

Advanced compiler design and implementation
MorphoSys: An Integrated Reconfigurable System for Data-Parallel and Computation-Intensive Applications

IEEE Transactions on Computers
Adapting software pipelining for reconfigurable computing

CASES '00 Proceedings of the 2000 international conference on Compilers, architecture, and synthesis for embedded systems
A decade of reconfigurable computing: a visionary retrospective

Proceedings of the conference on Design, automation and test in Europe
Parallelizing DSP nested loops on reconfigurable architectures using data context switching

Proceedings of the 38th annual Design Automation Conference
SPEC CPU2000: Measuring CPU Performance in the New Millennium

Computer
Introducing the IA-64 Architecture

IEEE Micro
Efficient Pipelining of Nested Loops: Unroll-and-Squash

IPDPS '02 Proceedings of the 16th International Parallel and Distributed Processing Symposium
Pipelining-Dovetailing: A Transformation to Enhance Software Pipelining for Nested Loops

CC '96 Proceedings of the 6th International Conference on Compiler Construction
Software Pipelining of Nested Loops

CC '01 Proceedings of the 10th International Conference on Compiler Construction
Single-Dimension Software Pipelining for Multi-Dimensional Loops

Proceedings of the international symposium on Code generation and optimization: feedback-directed and runtime optimization
Exploiting Loop-Level Parallelism on Coarse-Grained Reconfigurable Architectures Using Modulo Scheduling

DATE '03 Proceedings of the conference on Design, Automation and Test in Europe - Volume 1
Resource Sharing and Pipelining in Coarse-Grained Reconfigurable Architecture for Domain-Specific Optimization

Proceedings of the conference on Design, Automation and Test in Europe - Volume 1
Power-conscious configuration cache structure and code mapping for coarse-grained reconfigurable architecture

Proceedings of the 2006 international symposium on Low power electronics and design
Static analysis of processor stall cycle aggregation

CODES+ISSS '08 Proceedings of the 6th IEEE/ACM/IFIP international conference on Hardware/Software codesign and system synthesis
Edge-centric modulo scheduling for coarse-grained reconfigurable architectures

Proceedings of the 17th international conference on Parallel architectures and compilation techniques
Computer Organization and Design, Fourth Edition, Fourth Edition: The Hardware/Software Interface (The Morgan Kaufmann Series in Computer Architecture and Design)

Computer Organization and Design, Fourth Edition, Fourth Edition: The Hardware/Software Interface (The Morgan Kaufmann Series in Computer Architecture and Design)
A holistic approach for tightly coupled reconfigurable parallel processors

Microprocessors & Microsystems
Software Pipelining in Nested Loops with Prolog-Epilog Merging

HiPEAC '09 Proceedings of the 4th International Conference on High Performance Embedded Architectures and Compilers
CGRA express: accelerating execution using dynamic operation fusion

CASES '09 Proceedings of the 2009 international conference on Compilers, architecture, and synthesis for embedded systems
Polymorphic pipeline array: a flexible multicore accelerator with virtualized execution for mobile multimedia applications

Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture

Exploiting both pipelining and data parallelism with SIMD reconfigurable architecture

ARC'12 Proceedings of the 8th international conference on Reconfigurable Computing: architectures, tools and applications
Near-Optimal Microprocessor and Accelerators Codesign with Latency and Throughput Constraints

ACM Transactions on Architecture and Code Optimization (TACO)
Fast shared on-chip memory architecture for efficient hybrid computing with CGRAs

Proceedings of the Conference on Design, Automation and Test in Europe
Evaluator-executor transformation for efficient pipelining of loops with conditionals

ACM Transactions on Architecture and Code Optimization (TACO)
Configurable range memory for effective data reuse on programmable accelerators

ACM Transactions on Design Automation of Electronic Systems (TODAES)

Quantified Score

Hi-index	0.00

Visualization

Abstract

Pipelining algorithms are typically concerned with improving only the steady-state performance, or the kernel time. The pipeline setup time happens only once and therefore can be negligible compared to the kernel time. However, for Coarse-Grained Reconfigurable Architectures (CGRAs) used as a coprocessor to a main processor, pipeline setup can take much longer due to the communication delay between the two processors, and can become significant if it is repeated in an outer loop of a loop nest. In this paper we evaluate the overhead of such non-kernel execution times when mapping nested loops for CGRAs, and propose a novel architecture-compiler cooperative scheme to reduce the overhead, while also minimizing the number of extra configurations required. Our experimental results using loops from multimedia and scientific domains demonstrate that our proposed techniques can greatly increase the performance of nested loops by up to 2.87 times compared to the conventional approach of accelerating only the innermost loops. Moreover, the mappings generated by our techniques require only a modest number of configurations that can fit in recent reconfigurable architectures.