A programmable parallel accelerator for learning and classification

Authors:
Srihari Cadambi;Abhinandan Majumdar;Michela Becchi;Srimat Chakradhar;Hans Peter Graf
Affiliations:
NEC Laboratories America, Inc., Princeton, NJ, USA;NEC Laboratories America, Inc., Princeton, NJ, USA;NEC Laboratories America, Inc., Princeton, NJ, USA;NEC Laboratories America, Inc., Princeton, NJ, USA;NEC Laboratories America, Inc., Princeton, NJ, USA
Venue:
Proceedings of the 19th international conference on Parallel architectures and compilation techniques
Year:
2010

Citing 18
Cited 0

Fast training of support vector machines using sequential minimal optimization

Advances in kernel methods
The Raw Microprocessor: A Computational Fabric for Software Circuits and General-Purpose Programs

IEEE Micro
Programmable Stream Processors

Computer
Scaling to the End of Silicon with EDGE Architectures

Computer
High Performance Linear Algebra Operations on Reconfigurable Systems

SC '05 Proceedings of the 2005 ACM/IEEE conference on Supercomputing
VideoSense: towards effective online video advertising

Proceedings of the 15th international conference on Multimedia
High performance dense linear algebra on a spatially distributed processor

Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming
Image retrieval: Ideas, influences, and trends of the new age

ACM Computing Surveys (CSUR)
Larrabee: a many-core x86 architecture for visual computing

ACM SIGGRAPH 2008 papers
Fast support vector machine training and classification on graphics processors

Proceedings of the 25th international conference on Machine learning
A unified architecture for natural language processing: deep neural networks with multitask learning

Proceedings of the 25th international conference on Machine learning
Large-scale deep unsupervised learning using graphics processors

ICML '09 Proceedings of the 26th Annual International Conference on Machine Learning
A Massively Parallel Coprocessor for Convolutional Neural Networks

ASAP '09 Proceedings of the 2009 20th IEEE International Conference on Application-specific Systems, Architectures and Processors
Face Detection Using GPU-Based Convolutional Neural Networks

CAIP '09 Proceedings of the 13th International Conference on Computer Analysis of Images and Patterns
A Massively Parallel FPGA-Based Coprocessor for Support Vector Machines

FCCM '09 Proceedings of the 2009 17th IEEE Symposium on Field Programmable Custom Computing Machines
Learning to rank with (a lot of) word features

Information Retrieval
Least squares quantization in PCM

IEEE Transactions on Information Theory
Face recognition: a convolutional neural-network approach

IEEE Transactions on Neural Networks

Quantified Score

Hi-index	0.00

Visualization

Abstract

For learning and classification workloads that operate on large amounts of unstructured data with stringent performance constraints, general purpose processor performance scales poorly with data size. In this paper, we present a programmable accelerator for this workload domain. To architect the accelerator, we profile five representative workloads, and find that their computationally intensive portions can be formulated as matrix or vector operations generating large amounts of intermediate data, which are then reduced by a secondary operation such as array ranking, finding max/min and aggregation. The proposed accelerator, called MAPLE, has hundreds of simple processing elements (PEs) laid out in a two-dimensional grid, with two key features. First, it uses in-memory processing where on-chip memory blocks perform the secondary reduction operations. By doing so, the intermediate data are dynamically processed and never stored or sent off-chip. Second, MAPLE uses banked off-chip memory, and organizes its PEs into independent groups each with its own off-chip memory bank. These two features together allow MAPLE to scale its performance with data size. This paper describes the MAPLE architecture, explores its design space with a simulator, and illustrates how to automatically map application kernels to the hardware. We also implement a 512-PE FPGA prototype of MAPLE and find that it is 1.5-10x faster than a 2.5 GHz quad-core Xeon processor despite running at a modest 125 MHz.