Exploiting Narrow Accelerators with Data-Centric Subgraph Mapping

Authors:
Amir Hormati;Nathan Clark;Scott Mahlke
Affiliations:
University of Michigan - Ann Arbor;University of Michigan - Ann Arbor;University of Michigan - Ann Arbor
Venue:
Proceedings of the International Symposium on Code Generation and Optimization
Year:
2007

Citing 21
Cited 3

Code generation using tree matching and dynamic programming

ACM Transactions on Programming Languages and Systems (TOPLAS)
Instruction selection using binate covering for code size optimization

ICCAD '95 Proceedings of the 1995 IEEE/ACM international conference on Computer-aided design
Instruction selection for embedded DSPs with complex instructions

EURO-DAC '96/EURO-VHDL '96 Proceedings of the conference on European design automation
An Algorithm for Subgraph Isomorphism

Journal of the ACM (JACM)
Very low power pipelines using significance compression

Proceedings of the 33rd annual ACM/IEEE international symposium on Microarchitecture
Computers and Intractability: A Guide to the Theory of NP-Completeness

Computers and Intractability: A Guide to the Theory of NP-Completeness
High-Performance 3-1 Interlock Collapsing ALU's

IEEE Transactions on Computers
Exploiting data-width locality to increase superscalar execution bandwidth

Proceedings of the 35th annual ACM/IEEE international symposium on Microarchitecture
Automatic application-specific instruction-set extensions under microarchitectural constraints

Proceedings of the 40th annual Design Automation Conference
Software-Controlled Operand-Gating

Proceedings of the international symposium on Code generation and optimization: feedback-directed and runtime optimization
Using Dynamic Binary Translation to Fuse Dependent Instructions

Proceedings of the international symposium on Code generation and optimization: feedback-directed and runtime optimization
Speculative software management of datapath-width for energy optimization

Proceedings of the 2004 ACM SIGPLAN/SIGBED conference on Languages, compilers, and tools for embedded systems
From Sequences of Dependent Instructions to Functions: An Approach for Improving Performance without ILP or Speculation

Proceedings of the 31st annual international symposium on Computer architecture
Common subgraph isomorphism detection by backtracking search

Software—Practice & Experience
Dataflow Mini-Graphs: Amplifying Superscalar Capacity and Bandwidth

Proceedings of the 37th annual IEEE/ACM International Symposium on Microarchitecture
Application-Specific Processing on a General-Purpose Core via Transparent Instruction Set Customization

Proceedings of the 37th annual IEEE/ACM International Symposium on Microarchitecture
Register Packing: Exploiting Narrow-Width Operands for Reducing Register File Pressure

Proceedings of the 37th annual IEEE/ACM International Symposium on Microarchitecture
ISEGEN: Generation of High-Quality Instruction Set Extensions by Iterative Improvement

Proceedings of the conference on Design, Automation and Test in Europe - Volume 2
Static strands: safely collapsing dependence chains for increasing embedded power efficiency

LCTES '05 Proceedings of the 2005 ACM SIGPLAN/SIGBED conference on Languages, compilers, and tools for embedded systems
An Architecture Framework for Transparent Instruction Set Customization in Embedded Processors

Proceedings of the 32nd annual international symposium on Computer Architecture
Scalable subgraph mapping for acyclic computation accelerators

CASES '06 Proceedings of the 2006 international conference on Compilers, architecture and synthesis for embedded systems

VEAL: Virtualized Execution Accelerator for Loops

ISCA '08 Proceedings of the 35th Annual International Symposium on Computer Architecture
CGRA express: accelerating execution using dynamic operation fusion

CASES '09 Proceedings of the 2009 international conference on Compilers, architecture, and synthesis for embedded systems
Approximate graph clustering for program characterization

ACM Transactions on Architecture and Code Optimization (TACO) - HIPEAC Papers

Quantified Score

Hi-index	0.00

Visualization

Abstract

The demand for high performance has driven acyclic computation accelerators into extensive use in modern embedded and desktop architectures. Accelerators that are ideal from a software perspective, are difficult or impossible to integrate in many modern architectures, though, due to area and timing requirements. This reality is coupled with the observation that many application domains under-utilize accelerator hardware, because of the narrow data they operate on and the nature of their computation. In this work, we take advantage of these facts to design accelerators capable of executing in modern architectures by narrowing datapath width and reducing interconnect. Novel compiler techniques are developed in order to generate highquality code for the reduced-cost accelerators and prevent performance loss to the extent possible. First, data width profiling is used to statistically determine how wide program data will be at run time. This information is used by the subgraph mapping algorithm to optimally select subgraphs for execution on targeted narrow accelerators. Overall, our data-centric compilation techniques achieve on average 6.5%, and up to 12%, speed up over previous subgraph mapping algorithms for 8-bit accelerators. We also show that, with appropriate compiler support, the increase in the total number of execution cycles in reduced-interconnect accelerators is less than 1% of the fully-connected accelerator.