Information processing in dynamical systems: foundations of harmony theory
Parallel distributed processing: explorations in the microstructure of cognition, vol. 1
Learning and relearning in Boltzmann machines
Parallel distributed processing: explorations in the microstructure of cognition, vol. 1
A fast learning algorithm for deep belief nets
Neural Computation
High-performance reconfigurable hardware architecture for restricted Boltzmann machines
IEEE Transactions on Neural Networks
Building a multi-FPGA virtualized restricted boltzmann machine architecture using embedded MPI
Proceedings of the 19th ACM/SIGDA international symposium on Field programmable gate arrays
GPU-accelerated restricted boltzmann machine for collaborative filtering
ICA3PP'12 Proceedings of the 12th international conference on Algorithms and Architectures for Parallel Processing - Volume Part I
Hi-index | 0.00 |
Despite the popularity and success of neural networks in research, the number of resulting commercial or industrial applications have been limited. A primary cause of this lack of adoption is due to the fact that neural networks are usually implemented as software running on general-purpose processors. Algorithms to implement a neural network in software are typically O(n2) problems -- as a result, neural networks are unable to provide the performance and scalability required in non-academic settings. In this paper, we investigate how FPGAs can be used to take advantage of the inherent parallelism in neural networks to provide a better implementation in terms of scalability and performance. We will focus on the Restricted Boltzmann machine, a popular type of neural network, because its architecture is particularly well-suited to hardware designs. The proposed, multi-purpose hardware framework is designed to reduce the O(n22) problem into an O(n) implementation while only requiring O(n) resources. The framework is tested on a Xilinx Virtex II-Pro XC2VP70 FPGA running at 100MHz. The resources support a Restricted Boltzmann machine of 128x128 nodes, which results in a computational speed of 1.02 billion connection-updates-per-second and a speed-up of 35 fold over an optimized C program running on a 2.8GHz Intel processor.