CGRA express: accelerating execution using dynamic operation fusion

Authors:
Yongjun Park;Hyunchul Park;Scott Mahlke
Affiliations:
University of Michigan, Ann Arbor, MI, USA;University of Michigan, Ann Arbor, MI, USA;University of Michigan, Ann Arbor, MI, USA
Venue:
CASES '09 Proceedings of the 2009 international conference on Compilers, architecture, and synthesis for embedded systems
Year:
2009

Citing 19
Cited 6

Iterative modulo scheduling: an algorithm for software pipelining loops

MICRO 27 Proceedings of the 27th annual international symposium on Microarchitecture
Space-time scheduling of instruction-level parallelism on a raw machine

Proceedings of the eighth international conference on Architectural support for programming languages and operating systems
PipeRench: a co/processor for streaming multimedia acceleration

ISCA '99 Proceedings of the 26th annual international symposium on Computer architecture
The Garp Architecture and C Compiler

Computer
The Raw Microprocessor: A Computational Fabric for Software Circuits and General-Purpose Programs

IEEE Micro
Compilation Approach for Coarse-Grained Reconfigurable Architectures

IEEE Design & Test
The MorphoSys Parallel Reconfigurable System

Euro-Par '99 Proceedings of the 5th International Euro-Par Conference on Parallel Processing
Mapping applications to the RaPiD configurable architecture

FCCM '97 Proceedings of the 5th IEEE Symposium on FPGA-Based Custom Computing Machines
A Scalable Implementation of a Reconfigurable WCDMA Rake Receiver

Proceedings of the conference on Design, automation and test in Europe - Volume 3
Exploiting Loop-Level Parallelism on Coarse-Grained Reconfigurable Architectures Using Modulo Scheduling

DATE '03 Proceedings of the conference on Design, Automation and Test in Europe - Volume 1
Application-Specific Processing on a General-Purpose Core via Transparent Instruction Set Customization

Proceedings of the 37th annual IEEE/ACM International Symposium on Microarchitecture
An Architecture Framework for Transparent Instruction Set Customization in Embedded Processors

Proceedings of the 32nd annual international symposium on Computer Architecture
Power-conscious configuration cache structure and code mapping for coarse-grained reconfigurable architecture

Proceedings of the 2006 international symposium on Low power electronics and design
Modulo graph embedding: mapping applications onto coarse-grained reconfigurable architectures

CASES '06 Proceedings of the 2006 international conference on Compilers, architecture and synthesis for embedded systems
Exploiting Narrow Accelerators with Data-Centric Subgraph Mapping

Proceedings of the International Symposium on Code Generation and Optimization
VEAL: Virtualized Execution Accelerator for Loops

ISCA '08 Proceedings of the 35th Annual International Symposium on Computer Architecture
Compiling custom instructions onto expression-grained reconfigurable architectures

CASES '08 Proceedings of the 2008 international conference on Compilers, architectures and synthesis for embedded systems
Edge-centric modulo scheduling for coarse-grained reconfigurable architectures

Proceedings of the 17th international conference on Parallel architectures and compilation techniques
A New Array Fabric for Coarse-Grained Reconfigurable Architecture

DSD '08 Proceedings of the 2008 11th EUROMICRO Conference on Digital System Design Architectures, Methods and Tools

Resource recycling: putting idle resources to work on a composable accelerator

CASES '10 Proceedings of the 2010 international conference on Compilers, architectures and synthesis for embedded systems
An instruction-scheduling-aware data partitioning technique for coarse-grained reconfigurable architectures

Proceedings of the 2011 SIGPLAN/SIGBED conference on Languages, compilers and tools for embedded systems
Improving performance of nested loops on reconfigurable array processors

ACM Transactions on Architecture and Code Optimization (TACO) - HIPEAC Papers
Memory-based computing for performance and energy improvement in multicore architectures

Proceedings of the great lakes symposium on VLSI
Ingredients of adaptability: a survey of reconfigurable processors

VLSI Design
A just-in-time customizable processor

Proceedings of the International Conference on Computer-Aided Design

Quantified Score

Hi-index	0.00

Visualization

Abstract

Coarse-grained reconfigurable architectures (CGRAs) present an appealing hardware platform by providing programmability with the potential for high computation throughput, scalability, low cost, and energy efficiency. CGRAs have been effectively used for innermost loops that contain an abundant of instruction-level parallelism. Conversely, non-loop and outer-loop code are latency constrained and do not offer significant amounts of instruction-level parallelism. In these situations, CGRAs are ineffective as the majority of the resources remain idle. In this paper, dynamic operation fusion is introduced to enable CGRAs to effectively accelerate latency-constrained code regions. Dynamic operation fusion is enabled through the combination of a small bypass network added between function units in a conventional CGRA and a sub-cycle modulo scheduler to automatically identify opportunities for fusion. Results show that dynamic operation fusion reduced total application run-time by up to 17% on a 4x4 CGRA.