Convolution engine: balancing efficiency & flexibility in specialized computing

Authors:
Wajahat Qadeer;Rehan Hameed;Ofer Shacham;Preethi Venkatesan;Christos Kozyrakis;Mark A. Horowitz
Affiliations:
Stanford University, California;Stanford University, California;Stanford University, California;Stanford University, California;Stanford University, California;Stanford University, California
Venue:
Proceedings of the 40th Annual International Symposium on Computer Architecture
Year:
2013

Citing 16
Cited 4

Xtensa: A Configurable and Extensible Processor

IEEE Micro
Distinctive Image Features from Scale-Invariant Keypoints

International Journal of Computer Vision
Digital photography with flash and no-flash image pairs

ACM SIGGRAPH 2004 Papers
Full-Frame Video Stabilization

CVPR '05 Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05) - Volume 1 - Volume 01
High dynamic range imaging

ACM SIGGRAPH 2004 Course Notes
Chip multi-processor generator

Proceedings of the 44th annual Design Automation Conference
An Energy-Efficient Processor Architecture for Embedded Systems

IEEE Computer Architecture Letters
Conservation cores: reducing the energy of mature computations

Proceedings of the fifteenth edition of ASPLOS on Architectural support for programming languages and operating systems
The Frankencamera: an experimental platform for computational photography

ACM SIGGRAPH 2010 papers
Understanding sources of inefficiency in general-purpose chips

Proceedings of the 37th annual international symposium on Computer architecture
Customizable Domain-Specific Computing

IEEE Design & Test
SURF: speeded up robust features

ECCV'06 Proceedings of the 9th European conference on Computer Vision - Volume Part I
Avoiding game over: bringing design to the next level

Proceedings of the 49th Annual Design Automation Conference
Analysis and architecture design of an HDTV720p 30 frames/s H.264/AVC encoder

IEEE Transactions on Circuits and Systems for Video Technology
DySER: Unifying Functionality and Parallelism Specialization for Energy-Efficient Computing

IEEE Micro
GPUWattch: enabling energy optimizations in GPGPUs

Proceedings of the 40th Annual International Symposium on Computer Architecture

Architectural support for address translation on GPUs: designing memory management units for CPU/GPUs with unified address spaces

Proceedings of the 19th international conference on Architectural support for programming languages and operating systems
Q100: the architecture and design of a database processing unit

Proceedings of the 19th international conference on Architectural support for programming languages and operating systems
DianNao: a small-footprint high-throughput accelerator for ubiquitous machine-learning

Proceedings of the 19th international conference on Architectural support for programming languages and operating systems
Heterogeneous-race-free memory models

Proceedings of the 19th international conference on Architectural support for programming languages and operating systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper focuses on the trade-off between flexibility and efficiency in specialized computing. We observe that specialized units achieve most of their efficiency gains by tuning data storage and compute structures and their connectivity to the data-flow and data-locality patterns in the kernels. Hence, by identifying key data-flow patterns used in a domain, we can create efficient engines that can be programmed and reused across a wide range of applications. We present an example, the Convolution Engine (CE), specialized for the convolution-like data-flow that is common in computational photography, image processing, and video processing applications. CE achieves energy efficiency by capturing data reuse patterns, eliminating data transfer overheads, and enabling a large number of operations per memory access. We quantify the tradeoffs in efficiency and flexibility and demonstrate that CE is within a factor of 2-3x of the energy and area efficiency of custom units optimized for a single kernel. CE improves energy and area efficiency by 8-15x over a SIMD engine for most applications.