Dynamically Specialized Datapaths for energy efficient computing

Authors:
Venkatraman Govindaraju;Chen-Han Ho;Karthikeyan Sankaralingam
Affiliations:
Vertical Research Group, University of Wisconsin-Madison;Vertical Research Group, University of Wisconsin-Madison;Vertical Research Group, University of Wisconsin-Madison
Venue:
HPCA '11 Proceedings of the 2011 IEEE 17th International Symposium on High Performance Computer Architecture
Year:
2011

Citing 0
Cited 15

Bundled execution of recurring traces for energy-efficient general purpose processing

Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture
Idempotent processor architecture

Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture
Is dark silicon useful?: harnessing the four horsemen of the coming dark silicon apocalypse

Proceedings of the 49th Annual Design Automation Conference
Operating systems should manage accelerators

HotPar'12 Proceedings of the 4th USENIX conference on Hot Topics in Parallelism
LEAP: latency- energy- and area-optimized lookup pipeline

Proceedings of the eighth ACM/IEEE symposium on Architectures for networking and communications systems
Neural Acceleration for General-Purpose Approximate Programs

MICRO-45 Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture
A general constraint-centric scheduling framework for spatial architectures

Proceedings of the 34th ACM SIGPLAN conference on Programming language design and implementation
Triggered instructions: a control paradigm for spatially-programmed architectures

Proceedings of the 40th Annual International Symposium on Computer Architecture
Navigating big data with high-throughput, energy-efficient data partitioning

Proceedings of the 40th Annual International Symposium on Computer Architecture
Systematic evaluation of workload clustering for extremely energy-efficient architectures

ACM SIGARCH Computer Architecture News
Breaking SIMD shackles with an exposed flexible microarchitecture and the access execute PDG

PACT '13 Proceedings of the 22nd international conference on Parallel architectures and compilation techniques
Energy-efficient multithreading for a hierarchical heterogeneous multicore through locality-cognizant thread generation

Journal of Parallel and Distributed Computing
SWSL: software synthesis for network lookup

ANCS '13 Proceedings of the ninth ACM/IEEE symposium on Architectures for networking and communications systems
Meet the walkers: accelerating index traversals for in-memory databases

Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture
A just-in-time customizable processor

Proceedings of the International Conference on Computer-Aided Design

Quantified Score

Hi-index	0.00

Visualization

Abstract

Due to limits in technology scaling, energy efficiency of logic devices is decreasing in successive generations. To provide continued performance improvements without increasing power, regardless of the sequential or parallel nature of the application, microarchitectural energy efficiency must improve. We propose Dynamically Specialized Datapaths to improve the energy efficiency of general purpose programmable processors. The key insights of this work are the following. First, applications execute in phases and these phases can be determined by creating a path-tree of basic-blocks rooted at the inner-most loop. Second, specialized datapaths corresponding to these path-trees, which we refer to as DySER blocks, can be constructed by interconnecting a set of heterogeneous computation units with a circuit-switched network. These blocks can be easily integrated with a processor pipeline. A synthesized RTL implementation using an industry 55nm technology library shows a 64-functional-unit DySER block occupies approximately the same area as a 64 KB single-ported SRAM and can execute at 2 GHz. We extend the GCC compiler to identify path-trees and code-mapping to DySER and evaluate the PAR-SEC, SPEC and Parboil benchmarks suites. Our results show that in most cases two DySER blocks can achieve the same performance (within 5%) as having a specialized hardware module for each path-tree. A 64-FU DySER block can cover 12% to 100% of the dynamically executed instruction stream. When integrated with a dual-issue out-of-order processor, two DySER blocks provide geometric mean speedup of 2.1X (1.15X to 10X), and geometric mean energy reduction of 40% (up to 70%), and 60% energy reduction if no performance improvement is required.