Maximally equidistributed combined Tausworthe generators
Mathematics of Computation
BEE2: A High-End Reconfigurable Computing System
IEEE Design & Test
A fast learning algorithm for deep belief nets
Neural Computation
FPGA Implementations of Neural Networks
FPGA Implementations of Neural Networks
High-performance implementation of the level-3 BLAS
ACM Transactions on Mathematical Software (TOMS)
A high-performance FPGA architecture for restricted boltzmann machines
Proceedings of the ACM/SIGDA international symposium on Field programmable gate arrays
Large-scale deep unsupervised learning using graphics processors
ICML '09 Proceedings of the 26th Annual International Conference on Machine Learning
Artificial neural networks: a review of commercial hardware
Engineering Applications of Artificial Intelligence
The Impact of Arithmetic Representation on Implementing MLP-BP on FPGAs: A Study
IEEE Transactions on Neural Networks
Building a multi-FPGA virtualized restricted boltzmann machine architecture using embedded MPI
Proceedings of the 19th ACM/SIGDA international symposium on Field programmable gate arrays
GPU-accelerated restricted boltzmann machine for collaborative filtering
ICA3PP'12 Proceedings of the 12th international conference on Algorithms and Architectures for Parallel Processing - Volume Part I
ACM Transactions on Reconfigurable Technology and Systems (TRETS)
Hi-index | 0.00 |
Despite the popularity and success of neural networks in research, the number of resulting commercial or industrial applications has been limited. A primary cause for this lack of adoption is that neural networks are usually implemented as software running on general-purpose processors. Hence, a hardware implementation that can exploit the inherent parallelism in neural networks is desired. This paper investigates how the restricted Boltzmann machine (RBM), which is a popular type of neural network, can be mapped to a high-performance hardware architecture on field-programmable gate array (FPGA) platforms. The proposed modular framework is designed to reduce the time complexity of the computations through heavily customized hardware engines. A method to partition large RBMs into smaller congruent components is also presented, allowing the distribution of one RBM across multiple FPGA resources. The framework is tested on a platform of four Xilinx Virtex II-Pro XC2VP70 FPGAs running at 100 MHz through a variety of different configurations. The maximum performance was obtained by instantiating an RBM of 256 × 256 nodes distributed across four FPGAs, which resulted in a computational speed of 3.13 billion connection-updates-per-second and a speedup of 145- fold over an optimized C program running on a 2.8-GHz Intel processor.