Maximally equidistributed combined Tausworthe generators
Mathematics of Computation
Recognizing Handwritten Digits Using Hierarchical Products of Experts
IEEE Transactions on Pattern Analysis and Machine Intelligence
Testing parallel random number generators
Parallel Computing
BEE2: A High-End Reconfigurable Computing System
IEEE Design & Test
A fast learning algorithm for deep belief nets
Neural Computation
A high-performance FPGA architecture for restricted boltzmann machines
Proceedings of the ACM/SIGDA international symposium on Field programmable gate arrays
Large-scale deep unsupervised learning using graphics processors
ICML '09 Proceedings of the 26th Annual International Conference on Machine Learning
International Journal of Approximate Reasoning
A Large-Scale Architecture for Restricted Boltzmann Machines
FCCM '10 Proceedings of the 2010 18th IEEE Annual International Symposium on Field-Programmable Custom Computing Machines
High-performance reconfigurable hardware architecture for restricted Boltzmann machines
IEEE Transactions on Neural Networks
Hi-index | 0.00 |
Several FPGA architectures exist for accelerating Restricted Boltzmann Machines (RBMs). However, the network size for most is limited by the amount of available on-chip memory. Therefore, many FPGAs are required to implement very large networks for use in real-world applications. A virtualized design is able to time-multiplex the hardware resources and handle much larger networks but suffers a performance penalty due to the context switch. In this paper, we present a number of improvements to a virtualized FPGA architecture for RBMs. First, we take advantage of 16-bit arithmetic to pack larger networks onto a chip. Second, a custom DMA engine is designed to reduce the performance impact of the large amount of memory transactions. Finally, the architecture is scaled to multiple FPGAs to gain additional performance through coarse grain parallelism. The design effort required to implement these changes is minimized through the use of an embedded MPI framework. The architecture, tested on a Berkeley Emulation Engine 3 platform running at 100 Mhz, achieves a speed of 12.563 GCUPS on a 8192x8192 network.