A dynamically configurable coprocessor for convolutional neural networks

Authors:
Srimat Chakradhar;Murugan Sankaradas;Venkata Jakkula;Srihari Cadambi
Affiliations:
NEC Laboratories America, Inc., Princeton, NJ, USA;NEC Laboratories America, Inc., Princeton, NJ, USA;NEC Laboratories America, Inc., Princeton, NJ, USA;NEC Laboratories America, Inc., Princeton, NJ, USA
Venue:
Proceedings of the 37th annual international symposium on Computer architecture
Year:
2010

Citing 17
Cited 6

Artificial Neural Networks in Biomedicine

Artificial Neural Networks in Biomedicine
FPGA Implementation of a Pipelined On-Line Backpropagation

Journal of VLSI Signal Processing Systems
Histograms of Oriented Gradients for Human Detection

CVPR '05 Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05) - Volume 1 - Volume 01
Object Recognition with Features Inspired by Visual Cortex

CVPR '05 Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05) - Volume 2 - Volume 02
Multiclass Object Recognition with Sparse, Localized Features

CVPR '06 Proceedings of the 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition - Volume 1
Neural Networks in Finance: Gaining Predictive Edge in the Market (Academic Press Advanced Finance Series)

Neural Networks in Finance: Gaining Predictive Edge in the Market (Academic Press Advanced Finance Series)
FPGA Implementations of Neural Networks

FPGA Implementations of Neural Networks
Fast support vector machine training and classification on graphics processors

Proceedings of the 25th international conference on Machine learning
A unified architecture for natural language processing: deep neural networks with multitask learning

Proceedings of the 25th international conference on Machine learning
A multirange architecture for collision-free off-road robot navigation

Journal of Field Robotics - Special Issue on LAGR Program, Part I
Learning long-range vision for autonomous off-road driving

Journal of Field Robotics - Special Issue on LAGR Program, Part II
Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations

ICML '09 Proceedings of the 26th Annual International Conference on Machine Learning
Large-scale deep unsupervised learning using graphics processors

ICML '09 Proceedings of the 26th Annual International Conference on Machine Learning
Face Detection Using GPU-Based Convolutional Neural Networks

CAIP '09 Proceedings of the 13th International Conference on Computer Analysis of Images and Patterns
Speech, Audio, Image and Biomedical Signal Processing using Neural Networks

Speech, Audio, Image and Biomedical Signal Processing using Neural Networks
Face recognition: a convolutional neural-network approach

IEEE Transactions on Neural Networks
The Impact of Arithmetic Representation on Implementing MLP-BP on FPGAs: A Study

IEEE Transactions on Neural Networks

Efficiency optimization of trainable feature extractors for a consumer platform

ACIVS'11 Proceedings of the 13th international conference on Advanced concepts for intelligent vision systems
Accelerating neuromorphic vision algorithms for recognition

Proceedings of the 49th Annual Design Automation Conference
A defect-tolerant accelerator for emerging high-performance applications

Proceedings of the 39th Annual International Symposium on Computer Architecture
Automatic generation of software pipelines for heterogeneous parallel systems

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
An FPGA-based accelerator for cortical object classification

DATE '12 Proceedings of the Conference on Design, Automation and Test in Europe
DianNao: a small-footprint high-throughput accelerator for ubiquitous machine-learning

Proceedings of the 19th international conference on Architectural support for programming languages and operating systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

Convolutional neural networks (CNN) applications range from recognition and reasoning (such as handwriting recognition, facial expression recognition and video surveillance) to intelligent text applications such as semantic text analysis and natural language processing applications. Two key observations drive the design of a new architecture for CNN. First, CNN workloads exhibit a widely varying mix of three types of parallelism: parallelism within a convolution operation, intra-output parallelism where multiple input sources (features) are combined to create a single output, and inter-output parallelism where multiple, independent outputs (features) are computed simultaneously. Workloads differ significantly across different CNN applications, and across different layers of a CNN. Second, the number of processing elements in an architecture continues to scale (as per Moore's law) much faster than off-chip memory bandwidth (or pin-count) of chips. Based on these two observations, we show that for a given number of processing elements and off-chip memory bandwidth, a new CNN hardware architecture that dynamically configures the hardware on-the-fly to match the specific mix of parallelism in a given workload gives the best throughput performance. Our CNN compiler automatically translates high abstraction network specification into a parallel microprogram (a sequence of low-level VLIW instructions) that is mapped, scheduled and executed by the coprocessor. Compared to a 2.3 GHz quad-core, dual socket Intel Xeon, 1.35 GHz C870 GPU, and a 200 MHz FPGA implementation, our 120 MHz dynamically configurable architecture is 4x to 8x faster. This is the first CNN architecture to achieve real-time video stream processing (25 to 30 frames per second) on a wide range of object detection and recognition tasks.