Fast training of support vector machines using sequential minimal optimization
Advances in kernel methods
High Performance Linear Algebra Operations on Reconfigurable Systems
SC '05 Proceedings of the 2005 ACM/IEEE conference on Supercomputing
VideoSense: towards effective online video advertising
Proceedings of the 15th international conference on Multimedia
High performance dense linear algebra on a spatially distributed processor
Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming
Image retrieval: Ideas, influences, and trends of the new age
ACM Computing Surveys (CSUR)
Larrabee: a many-core x86 architecture for visual computing
ACM SIGGRAPH 2008 papers
Understanding and Designing New Server Architectures for Emerging Warehouse-Computing Environments
ISCA '08 Proceedings of the 35th Annual International Symposium on Computer Architecture
Fast support vector machine training and classification on graphics processors
Proceedings of the 25th international conference on Machine learning
A unified architecture for natural language processing: deep neural networks with multitask learning
Proceedings of the 25th international conference on Machine learning
Large-scale deep unsupervised learning using graphics processors
ICML '09 Proceedings of the 26th Annual International Conference on Machine Learning
A Massively Parallel Coprocessor for Convolutional Neural Networks
ASAP '09 Proceedings of the 2009 20th IEEE International Conference on Application-specific Systems, Architectures and Processors
Face Detection Using GPU-Based Convolutional Neural Networks
CAIP '09 Proceedings of the 13th International Conference on Computer Analysis of Images and Patterns
A Massively Parallel FPGA-Based Coprocessor for Support Vector Machines
FCCM '09 Proceedings of the 2009 17th IEEE Symposium on Field Programmable Custom Computing Machines
Web search using mobile cores: quantifying and mitigating the price of efficiency
Proceedings of the 37th annual international symposium on Computer architecture
Learning to rank with (a lot of) word features
Information Retrieval
FAWNdamentally power-efficient clusters
HotOS'09 Proceedings of the 12th conference on Hot topics in operating systems
Least squares quantization in PCM
IEEE Transactions on Information Theory
High-performance architecture for dynamically updatable packet classification on FPGA
ANCS '13 Proceedings of the ninth ACM/IEEE symposium on Architectures for networking and communications systems
Hi-index | 0.00 |
Applications that use learning and classification algorithms operate on large amounts of unstructured data, and have stringent performance constraints. For such applications, the performance of general purpose processors scales poorly with data size because of their limited support for fine-grained parallelism and absence of software-managed caches. The large intermediate data in these applications also limits achievable performance on many-core processors such as GPUs. To accelerate such learning applications, we present a programmable accelerator that can execute multiple learning and classification algorithms. To architect such an accelerator, we profile five representative workloads, and find that their computationally intensive portions can be formulated as matrix or vector operations generating large amounts of intermediate data, which are then reduced by a secondary operation such as array ranking, finding max/min and aggregation. Our proposed accelerator, called MAPLE, has hundreds of simple processing elements (PEs) laid out in a two-dimensional grid, with two key features. First, it uses dynamic in-memory processing where on-chip memory blocks perform the secondary reduction operations. Second, MAPLE uses banked off-chip memory, and organizes its PEs into independent groups each with its own off-chip memory bank. These two features allow MAPLE to scale its performance with data size. We also present an Atom based energy-efficient heterogeneous system with MAPLE as the accelerator that satisfies the application’s performance requirements at a lower system power. This article describes the MAPLE architecture, explores its design space with a simulator, illustrates how to automatically map application kernels to the hardware, and presents its performance improvement and energy benefits over classic server-based implementations. We implement a 512-PE FPGA prototype of MAPLE and find that it is 1.5-10x faster than a 2.5 GHz quad-core Xeon processor despite running at a modest 125 MHz clock rate. With MAPLE connected to a 1.6GHz dual-core Atom, we show an energy improvement of 38-84% over the Xeon server coupled to a 1.3 GHz 240 core Tesla GPU.