Neural Acceleration for General-Purpose Approximate Programs

Authors:
Hadi Esmaeilzadeh;Adrian Sampson;Luis Ceze;Doug Burger
Affiliations:
-;-;-;-
Venue:
MICRO-45 Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture
Year:
2012

Citing 35
Cited 12

Learning internal representations by error propagation

Parallel distributed processing: explorations in the microstructure of cognition, vol. 1
Parallel digital implementations of neural networks

Parallel digital implementations of neural networks
A high-performance microarchitecture with hardware-programmable functional units

MICRO 27 Proceedings of the 27th annual international symposium on Microarchitecture
Energy-efficient signal processing via algorithmic noise-tolerance

ISLPED '99 Proceedings of the 1999 international symposium on Low power electronics and design
Application-Specific Processing on a General-Purpose Core via Transparent Instruction Set Customization

Proceedings of the 37th annual IEEE/ACM International Symposium on Microarchitecture
Fuzzy Memoization for Floating-Point Multimedia Applications

IEEE Transactions on Computers
Ultra-efficient (embedded) SOC architectures based on probabilistic CMOS (PCMOS) technology

Proceedings of the conference on Design, automation and test in Europe: Proceedings
Optimizing NUCA Organizations and Wiring Alternatives for Large Caches with CACTI 6.0

Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture
CHiMPS: a high-level compilation flow for hybrid CPU-FPGA architectures

Proceedings of the 16th international ACM/SIGDA symposium on Field programmable gate arrays
McPAT: an integrated power, area, and timing modeling framework for multicore and manycore architectures

Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture
Multifold Acceleration of Neural Network Computations Using GPU

ICANN '09 Proceedings of the 19th International Conference on Artificial Neural Networks: Part I
Conservation cores: reducing the energy of mature computations

Proceedings of the fifteenth edition of ASPLOS on Architectural support for programming languages and operating systems
Green: a framework for supporting energy-conscious programming using controlled approximation

PLDI '10 Proceedings of the 2010 ACM SIGPLAN conference on Programming language design and implementation
Quality of service profiling

Proceedings of the 32nd ACM/IEEE International Conference on Software Engineering - Volume 1
Understanding sources of inefficiency in general-purpose chips

Proceedings of the 37th annual international symposium on Computer architecture
Relax: an architectural framework for software recovery of hardware faults

Proceedings of the 37th annual international symposium on Computer architecture
Scalable stochastic processors

Proceedings of the Conference on Design, Automation and Test in Europe
ERSA: error resilient system architecture for probabilistic applications

Proceedings of the Conference on Design, Automation and Test in Europe
A case for neuromorphic ISAs

Proceedings of the sixteenth international conference on Architectural support for programming languages and operating systems
Flikker: saving DRAM refresh-power through critical data partitioning

Proceedings of the sixteenth international conference on Architectural support for programming languages and operating systems
Energy-Efficient Floating-Point Unit Design

IEEE Transactions on Computers
EnerJ: approximate data types for safe and general low-power computation

Proceedings of the 32nd ACM SIGPLAN conference on Programming language design and implementation
Automatic abstraction and fault tolerance in cortical microachitectures

Proceedings of the 38th annual international symposium on Computer architecture
Dark silicon and the end of multicore scaling

Proceedings of the 38th annual international symposium on Computer architecture
Dynamically Specialized Datapaths for energy efficient computing

HPCA '11 Proceedings of the 2011 IEEE 17th International Symposium on High Performance Computer Architecture
MARSS: a full system simulator for multicore x86 CPUs

Proceedings of the 48th Design Automation Conference
Managing performance vs. accuracy trade-offs with loop perforation

Proceedings of the 19th ACM SIGSOFT symposium and the 13th European conference on Foundations of software engineering
A Fault Criticality Evaluation Framework of Digital Systems for Error Tolerant Video Applications

ATS '11 Proceedings of the 2011 Asian Test Symposium
Architecture support for disciplined approximate programming

ASPLOS XVII Proceedings of the seventeenth international conference on Architectural Support for Programming Languages and Operating Systems
Bundled execution of recurring traces for energy-efficient general purpose processing

Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture
QsCores: trading dark silicon for scalable energy efficiency with quasi-specific cores

Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture
Estimations of error bounds for neural-network function approximators

IEEE Transactions on Neural Networks
A defect-tolerant accelerator for emerging high-performance applications

Proceedings of the 39th Annual International Symposium on Computer Architecture
Codesign Tradeoffs for High-Performance, Low-Power Linear Algebra Architectures

IEEE Transactions on Computers
BenchNN: On the broad potential application scope of hardware neural network accelerators

IISWC '12 Proceedings of the 2012 IEEE International Symposium on Workload Characterization (IISWC)

A general constraint-centric scheduling framework for spatial architectures

Proceedings of the 34th ACM SIGPLAN conference on Programming language design and implementation
Continuous real-world inputs can open up alternative accelerator designs

Proceedings of the 40th Annual International Symposium on Computer Architecture
Verifying quantitative reliability for programs that execute on unreliable hardware

Proceedings of the 2013 ACM SIGPLAN international conference on Object oriented programming systems languages & applications
Quality programmable vector processors for approximate computing

Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture
SAGE: self-tuning approximation for graphics engines

Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture
Approximate storage in solid-state memories

Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture
Paraprox: pattern-based approximation for data parallel applications

Proceedings of the 19th international conference on Architectural support for programming languages and operating systems
Uncertain: a first-order type for uncertain data

Proceedings of the 19th international conference on Architectural support for programming languages and operating systems
DianNao: a small-footprint high-throughput accelerator for ubiquitous machine-learning

Proceedings of the 19th international conference on Architectural support for programming languages and operating systems
Post-compiler software optimization for reducing energy

Proceedings of the 19th international conference on Architectural support for programming languages and operating systems
Synthesis-friendly techniques for tightly-coupled integration of hardware accelerators into shared-memory multi-core clusters

Proceedings of the Ninth IEEE/ACM/IFIP International Conference on Hardware/Software Codesign and System Synthesis
A variability-aware OpenMP environment for efficient execution of accuracy-configurable computation on shared-FPU processor clusters

Proceedings of the Ninth IEEE/ACM/IFIP International Conference on Hardware/Software Codesign and System Synthesis

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper describes a learning-based approach to the acceleration of approximate programs. We describe the \emph{Parrot transformation}, a program transformation that selects and trains a neural network to mimic a region of imperative code. After the learning phase, the compiler replaces the original code with an invocation of a low-power accelerator called a \emph{neural processing unit} (NPU). The NPU is tightly coupled to the processor pipeline to accelerate small code regions. Since neural networks produce inherently approximate results, we define a programming model that allows programmers to identify approximable code regions -- code that can produce imprecise but acceptable results. Offloading approximable code regions to NPUs is faster and more energy efficient than executing the original code. For a set of diverse applications, NPU acceleration provides whole-application speedup of 2.3x and energy savings of 3.0x on average with quality loss of at most 9.6%.