Fast training of support vector machines using sequential minimal optimization
Advances in kernel methods
Programmable Stream Processors
Computer
High Performance Linear Algebra Operations on Reconfigurable Systems
SC '05 Proceedings of the 2005 ACM/IEEE conference on Supercomputing
VideoSense: towards effective online video advertising
Proceedings of the 15th international conference on Multimedia
High performance dense linear algebra on a spatially distributed processor
Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming
Image retrieval: Ideas, influences, and trends of the new age
ACM Computing Surveys (CSUR)
Larrabee: a many-core x86 architecture for visual computing
ACM SIGGRAPH 2008 papers
Fast support vector machine training and classification on graphics processors
Proceedings of the 25th international conference on Machine learning
A unified architecture for natural language processing: deep neural networks with multitask learning
Proceedings of the 25th international conference on Machine learning
Large-scale deep unsupervised learning using graphics processors
ICML '09 Proceedings of the 26th Annual International Conference on Machine Learning
A Massively Parallel Coprocessor for Convolutional Neural Networks
ASAP '09 Proceedings of the 2009 20th IEEE International Conference on Application-specific Systems, Architectures and Processors
Face Detection Using GPU-Based Convolutional Neural Networks
CAIP '09 Proceedings of the 13th International Conference on Computer Analysis of Images and Patterns
A Massively Parallel FPGA-Based Coprocessor for Support Vector Machines
FCCM '09 Proceedings of the 2009 17th IEEE Symposium on Field Programmable Custom Computing Machines
Learning to rank with (a lot of) word features
Information Retrieval
Least squares quantization in PCM
IEEE Transactions on Information Theory
Face recognition: a convolutional neural-network approach
IEEE Transactions on Neural Networks
Hi-index | 0.00 |
For learning and classification workloads that operate on large amounts of unstructured data with stringent performance constraints, general purpose processor performance scales poorly with data size. In this paper, we present a programmable accelerator for this workload domain. To architect the accelerator, we profile five representative workloads, and find that their computationally intensive portions can be formulated as matrix or vector operations generating large amounts of intermediate data, which are then reduced by a secondary operation such as array ranking, finding max/min and aggregation. The proposed accelerator, called MAPLE, has hundreds of simple processing elements (PEs) laid out in a two-dimensional grid, with two key features. First, it uses in-memory processing where on-chip memory blocks perform the secondary reduction operations. By doing so, the intermediate data are dynamically processed and never stored or sent off-chip. Second, MAPLE uses banked off-chip memory, and organizes its PEs into independent groups each with its own off-chip memory bank. These two features together allow MAPLE to scale its performance with data size. This paper describes the MAPLE architecture, explores its design space with a simulator, and illustrates how to automatically map application kernels to the hardware. We also implement a 512-PE FPGA prototype of MAPLE and find that it is 1.5-10x faster than a 2.5 GHz quad-core Xeon processor despite running at a modest 125 MHz.