Building a multi-FPGA virtualized restricted boltzmann machine architecture using embedded MPI

Authors:
Charles Lo;Paul Chow
Affiliations:
University of Toronto, Toronto, ON, Canada;University of Toronto, Toronto, ON, Canada
Venue:
Proceedings of the 19th ACM/SIGDA international symposium on Field programmable gate arrays
Year:
2011

Citing 10
Cited 0

Maximally equidistributed combined Tausworthe generators

Mathematics of Computation
Recognizing Handwritten Digits Using Hierarchical Products of Experts

IEEE Transactions on Pattern Analysis and Machine Intelligence
Testing parallel random number generators

Parallel Computing
BEE2: A High-End Reconfigurable Computing System

IEEE Design & Test
A fast learning algorithm for deep belief nets

Neural Computation
A high-performance FPGA architecture for restricted boltzmann machines

Proceedings of the ACM/SIGDA international symposium on Field programmable gate arrays
Large-scale deep unsupervised learning using graphics processors

ICML '09 Proceedings of the 26th Annual International Conference on Machine Learning
Semantic hashing

International Journal of Approximate Reasoning
A Large-Scale Architecture for Restricted Boltzmann Machines

FCCM '10 Proceedings of the 2010 18th IEEE Annual International Symposium on Field-Programmable Custom Computing Machines
High-performance reconfigurable hardware architecture for restricted Boltzmann machines

IEEE Transactions on Neural Networks

Quantified Score

Hi-index	0.00

Visualization

Abstract

Several FPGA architectures exist for accelerating Restricted Boltzmann Machines (RBMs). However, the network size for most is limited by the amount of available on-chip memory. Therefore, many FPGAs are required to implement very large networks for use in real-world applications. A virtualized design is able to time-multiplex the hardware resources and handle much larger networks but suffers a performance penalty due to the context switch. In this paper, we present a number of improvements to a virtualized FPGA architecture for RBMs. First, we take advantage of 16-bit arithmetic to pack larger networks onto a chip. Second, a custom DMA engine is designed to reduce the performance impact of the large amount of memory transactions. Finally, the architecture is scaled to multiple FPGAs to gain additional performance through coarse grain parallelism. The design effort required to implement these changes is minimized through the use of an embedded MPI framework. The architecture, tested on a Berkeley Emulation Engine 3 platform running at 100 Mhz, achieves a speed of 12.563 GCUPS on a 8192x8192 network.