Building a multi-FPGA virtualized restricted boltzmann machine architecture using embedded MPI

  • Authors:
  • Charles Lo;Paul Chow

  • Affiliations:
  • University of Toronto, Toronto, ON, Canada;University of Toronto, Toronto, ON, Canada

  • Venue:
  • Proceedings of the 19th ACM/SIGDA international symposium on Field programmable gate arrays
  • Year:
  • 2011

Quantified Score

Hi-index 0.00

Visualization

Abstract

Several FPGA architectures exist for accelerating Restricted Boltzmann Machines (RBMs). However, the network size for most is limited by the amount of available on-chip memory. Therefore, many FPGAs are required to implement very large networks for use in real-world applications. A virtualized design is able to time-multiplex the hardware resources and handle much larger networks but suffers a performance penalty due to the context switch. In this paper, we present a number of improvements to a virtualized FPGA architecture for RBMs. First, we take advantage of 16-bit arithmetic to pack larger networks onto a chip. Second, a custom DMA engine is designed to reduce the performance impact of the large amount of memory transactions. Finally, the architecture is scaled to multiple FPGAs to gain additional performance through coarse grain parallelism. The design effort required to implement these changes is minimized through the use of an embedded MPI framework. The architecture, tested on a Berkeley Emulation Engine 3 platform running at 100 Mhz, achieves a speed of 12.563 GCUPS on a 8192x8192 network.