IMORC: An infrastructure and architecture template for implementing high-performance reconfigurable FPGA accelerators

  • Authors:
  • Tobias Schumacher;Christian Plessl;Marco Platzner

  • Affiliations:
  • Paderborn Center for Parallel Computing, University of Paderborn, 33098 Paderborn, Germany;Paderborn Center for Parallel Computing, University of Paderborn, 33098 Paderborn, Germany;Paderborn Center for Parallel Computing, University of Paderborn, 33098 Paderborn, Germany

  • Venue:
  • Microprocessors & Microsystems
  • Year:
  • 2012

Quantified Score

Hi-index 0.00

Visualization

Abstract

The design, implementation and optimization of FPGA accelerators is a challenging task, especially when the accelerator comprises multiple compute cores distributed across CPU and FPGA resources and memories and exhibits data-dependent runtime behavior. In order to simplify the development of FPGA accelerators we propose IMORC, an infrastructure and architecture template that helps raising the level of abstraction. The IMORC development flow bases on a modeling technique for visualizing an application's communication demand and an architecture template that aids the developer in implementing the design. The architectural template consists of a versatile on-chip interconnect with asynchronous FIFOs and bitwidth conversion placed into the communication links, a performance monitoring infrastructure for collecting performance information during runtime and a set of generic infrastructure cores which are frequently needed in accelerator designs. We demonstrate the usefulness of the IMORC development flow by means of the case study of accelerating the kth nearest neighbor thinning problem, where IMORC greatly helps us in understanding the communication demand and in implementing the application. With the integrated performance monitoring infrastructure, we gain insights into the data-dependent behavior of the accelerator that helps us in identifying bottlenecks and optimizing the accelerator to achieve a speedup of 10x to 40x over an optimized CPU implementation.