Iterative modulo scheduling: an algorithm for software pipelining loops
MICRO 27 Proceedings of the 27th annual international symposium on Microarchitecture
Advanced compiler design and implementation
Advanced compiler design and implementation
IEEE Transactions on Computers
Adapting software pipelining for reconfigurable computing
CASES '00 Proceedings of the 2000 international conference on Compilers, architecture, and synthesis for embedded systems
A decade of reconfigurable computing: a visionary retrospective
Proceedings of the conference on Design, automation and test in Europe
Parallelizing DSP nested loops on reconfigurable architectures using data context switching
Proceedings of the 38th annual Design Automation Conference
Introducing the IA-64 Architecture
IEEE Micro
Efficient Pipelining of Nested Loops: Unroll-and-Squash
IPDPS '02 Proceedings of the 16th International Parallel and Distributed Processing Symposium
Pipelining-Dovetailing: A Transformation to Enhance Software Pipelining for Nested Loops
CC '96 Proceedings of the 6th International Conference on Compiler Construction
Software Pipelining of Nested Loops
CC '01 Proceedings of the 10th International Conference on Compiler Construction
Single-Dimension Software Pipelining for Multi-Dimensional Loops
Proceedings of the international symposium on Code generation and optimization: feedback-directed and runtime optimization
DATE '03 Proceedings of the conference on Design, Automation and Test in Europe - Volume 1
Proceedings of the conference on Design, Automation and Test in Europe - Volume 1
Proceedings of the 2006 international symposium on Low power electronics and design
Static analysis of processor stall cycle aggregation
CODES+ISSS '08 Proceedings of the 6th IEEE/ACM/IFIP international conference on Hardware/Software codesign and system synthesis
Edge-centric modulo scheduling for coarse-grained reconfigurable architectures
Proceedings of the 17th international conference on Parallel architectures and compilation techniques
Computer Organization and Design, Fourth Edition, Fourth Edition: The Hardware/Software Interface (The Morgan Kaufmann Series in Computer Architecture and Design)
A holistic approach for tightly coupled reconfigurable parallel processors
Microprocessors & Microsystems
Software Pipelining in Nested Loops with Prolog-Epilog Merging
HiPEAC '09 Proceedings of the 4th International Conference on High Performance Embedded Architectures and Compilers
CGRA express: accelerating execution using dynamic operation fusion
CASES '09 Proceedings of the 2009 international conference on Compilers, architecture, and synthesis for embedded systems
Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture
Exploiting both pipelining and data parallelism with SIMD reconfigurable architecture
ARC'12 Proceedings of the 8th international conference on Reconfigurable Computing: architectures, tools and applications
Near-Optimal Microprocessor and Accelerators Codesign with Latency and Throughput Constraints
ACM Transactions on Architecture and Code Optimization (TACO)
Fast shared on-chip memory architecture for efficient hybrid computing with CGRAs
Proceedings of the Conference on Design, Automation and Test in Europe
Evaluator-executor transformation for efficient pipelining of loops with conditionals
ACM Transactions on Architecture and Code Optimization (TACO)
Configurable range memory for effective data reuse on programmable accelerators
ACM Transactions on Design Automation of Electronic Systems (TODAES)
Hi-index | 0.00 |
Pipelining algorithms are typically concerned with improving only the steady-state performance, or the kernel time. The pipeline setup time happens only once and therefore can be negligible compared to the kernel time. However, for Coarse-Grained Reconfigurable Architectures (CGRAs) used as a coprocessor to a main processor, pipeline setup can take much longer due to the communication delay between the two processors, and can become significant if it is repeated in an outer loop of a loop nest. In this paper we evaluate the overhead of such non-kernel execution times when mapping nested loops for CGRAs, and propose a novel architecture-compiler cooperative scheme to reduce the overhead, while also minimizing the number of extra configurations required. Our experimental results using loops from multimedia and scientific domains demonstrate that our proposed techniques can greatly increase the performance of nested loops by up to 2.87 times compared to the conventional approach of accelerating only the innermost loops. Moreover, the mappings generated by our techniques require only a modest number of configurations that can fit in recent reconfigurable architectures.