A Large-Scale Architecture for Restricted Boltzmann Machines

Authors:
Sang Kyun Kim;Peter Leonard McMahon;Kunle Olukotun
Affiliations:
-;-;-
Venue:
FCCM '10 Proceedings of the 2010 18th IEEE Annual International Symposium on Field-Programmable Custom Computing Machines
Year:
2010

Citing 0
Cited 4

Building a multi-FPGA virtualized restricted boltzmann machine architecture using embedded MPI

Proceedings of the 19th ACM/SIGDA international symposium on Field programmable gate arrays
GPU-accelerated restricted boltzmann machine for collaborative filtering

ICA3PP'12 Proceedings of the 12th international conference on Algorithms and Architectures for Parallel Processing - Volume Part I
Towards adaptive learning with improved convergence of deep belief networks on graphics processing units

Pattern Recognition
A Fully Pipelined FPGA Architecture of a Factored Restricted Boltzmann Machine Artificial Neural Network

ACM Transactions on Reconfigurable Technology and Systems (TRETS)

Quantified Score

Hi-index	0.01

Visualization

Abstract

Deep Belief Nets (DBNs) are an emerging application in the machine learning domain, which use Restricted Boltzmann Machines (RBMs) as their basic building block. Although small scale DBNs have shown great potential, the computational cost of RBM training has been a major challenge in scaling to large networks. In this paper we present a highly scalable architecture for Deep Belief Net processing on hardware systems that can handle hundreds of boards, if not more, of customized logic with near linear performance increase. We elucidate tradeoffs between flexibility in the neuron connections, and the hardware resources, such as memory and communication bandwidth, required to build a custom processor design that has optimal efficiency. We illustrate how our architecture can easily support sparse networks with dense regions of connections between neighboring sets of neurons, which is relevant to applications where there are obvious spatial correlations in the data, such as in image processing. We demonstrate the feasibility of our approach by implementing a multi-FPGA system. We show that a speedup of 46X-112X over an optimized single core CPU implementation can be achieved for a four-FPGA implementation.