A Massively Parallel, Energy Efficient Programmable Accelerator for Learning and Classification

Authors:
Abhinandan Majumdar;Srihari Cadambi;Michela Becchi;Srimat T. Chakradhar;Hans Peter Graf
Affiliations:
NEC Laboratories America, Inc.;NEC Laboratories America, Inc.;NEC Laboratories America, Inc.;NEC Laboratories America, Inc.;NEC Laboratories America, Inc.
Venue:
ACM Transactions on Architecture and Code Optimization (TACO)
Year:
2012

Citing 19
Cited 1

Fast training of support vector machines using sequential minimal optimization

Advances in kernel methods
The Raw Microprocessor: A Computational Fabric for Software Circuits and General-Purpose Programs

IEEE Micro
Scaling to the End of Silicon with EDGE Architectures

Computer
High Performance Linear Algebra Operations on Reconfigurable Systems

SC '05 Proceedings of the 2005 ACM/IEEE conference on Supercomputing
VideoSense: towards effective online video advertising

Proceedings of the 15th international conference on Multimedia
High performance dense linear algebra on a spatially distributed processor

Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming
Image retrieval: Ideas, influences, and trends of the new age

ACM Computing Surveys (CSUR)
Larrabee: a many-core x86 architecture for visual computing

ACM SIGGRAPH 2008 papers
Understanding and Designing New Server Architectures for Emerging Warehouse-Computing Environments

ISCA '08 Proceedings of the 35th Annual International Symposium on Computer Architecture
Fast support vector machine training and classification on graphics processors

Proceedings of the 25th international conference on Machine learning
A unified architecture for natural language processing: deep neural networks with multitask learning

Proceedings of the 25th international conference on Machine learning
Large-scale deep unsupervised learning using graphics processors

ICML '09 Proceedings of the 26th Annual International Conference on Machine Learning
A Massively Parallel Coprocessor for Convolutional Neural Networks

ASAP '09 Proceedings of the 2009 20th IEEE International Conference on Application-specific Systems, Architectures and Processors
Face Detection Using GPU-Based Convolutional Neural Networks

CAIP '09 Proceedings of the 13th International Conference on Computer Analysis of Images and Patterns
A Massively Parallel FPGA-Based Coprocessor for Support Vector Machines

FCCM '09 Proceedings of the 2009 17th IEEE Symposium on Field Programmable Custom Computing Machines
Web search using mobile cores: quantifying and mitigating the price of efficiency

Proceedings of the 37th annual international symposium on Computer architecture
Learning to rank with (a lot of) word features

Information Retrieval
FAWNdamentally power-efficient clusters

HotOS'09 Proceedings of the 12th conference on Hot topics in operating systems
Least squares quantization in PCM

IEEE Transactions on Information Theory

High-performance architecture for dynamically updatable packet classification on FPGA

ANCS '13 Proceedings of the ninth ACM/IEEE symposium on Architectures for networking and communications systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

Applications that use learning and classification algorithms operate on large amounts of unstructured data, and have stringent performance constraints. For such applications, the performance of general purpose processors scales poorly with data size because of their limited support for fine-grained parallelism and absence of software-managed caches. The large intermediate data in these applications also limits achievable performance on many-core processors such as GPUs. To accelerate such learning applications, we present a programmable accelerator that can execute multiple learning and classification algorithms. To architect such an accelerator, we profile five representative workloads, and find that their computationally intensive portions can be formulated as matrix or vector operations generating large amounts of intermediate data, which are then reduced by a secondary operation such as array ranking, finding max/min and aggregation. Our proposed accelerator, called MAPLE, has hundreds of simple processing elements (PEs) laid out in a two-dimensional grid, with two key features. First, it uses dynamic in-memory processing where on-chip memory blocks perform the secondary reduction operations. Second, MAPLE uses banked off-chip memory, and organizes its PEs into independent groups each with its own off-chip memory bank. These two features allow MAPLE to scale its performance with data size. We also present an Atom based energy-efficient heterogeneous system with MAPLE as the accelerator that satisfies the application’s performance requirements at a lower system power. This article describes the MAPLE architecture, explores its design space with a simulator, illustrates how to automatically map application kernels to the hardware, and presents its performance improvement and energy benefits over classic server-based implementations. We implement a 512-PE FPGA prototype of MAPLE and find that it is 1.5-10x faster than a 2.5 GHz quad-core Xeon processor despite running at a modest 125 MHz clock rate. With MAPLE connected to a 1.6GHz dual-core Atom, we show an energy improvement of 38-84% over the Xeon server coupled to a 1.3 GHz 240 core Tesla GPU.