Computation reuse in domain-specific optimization of signal recognition

  • Authors:
  • Melina Demertzi;Pedro C. Diniz;Mary W. Hall;Anna C. Gilbert;Yi Wang

  • Affiliations:
  • University of Southern California, Los Angeles, CA, USA;IST/UTL/INESC-ID, Porto Salvo, Portugal;University of Utah, Salt Lake City, UT, USA;University of Michigan, Ann Arbor, MI, USA;University of Michigan, Ann Arbor, MI, USA

  • Venue:
  • Proceedings of the ACM/SIGDA international symposium on Field programmable gate arrays
  • Year:
  • 2009

Quantified Score

Hi-index 0.00

Visualization

Abstract

Domain-specific optimizations that exploit specific arithmetic and representation formats have been shown to achieve significant performance/area gains in FPGA hardware designs. In this work, we describe an approach to domain-specific optimization that goes beyond this representation level. We perform a joint optimization from a high-level mathematical abstract representation and hardware implementation point of view. We focus on a signal recognition system that distinguishes between spoken digits. We construct transform matrices from Walsh wavelet packets in conjunction with a BestBasis algorithm. The resulting transform matrices exhibit a rich algebraic structure and contain significant overlap across rows, exhibiting significant computation reuse in the dot-product operation of the transform matrix applied to the signal vector. We have developed an algorithm for identifying the computation reuse and scheduling the row computations across various computation units to significantly reduce the overall amount of computation. We have implemented a custom-built dot-product multiplication unit targeting a Virtex-II-Pro FPGA device that exploits computation reuse. A baseline dot-product multiplication unit, without reuse, exhibits a maximum clock rate of 199.3 MHz while utilizing only 2% of the device capacity. The optimized system that exploits reuse also includes a computation scheduler and attains a respectable clock rate of 196 MHz while using 8,183 (57%) slices of the FPGA device. The FPGA hardware implementation reduces the amount of computation for an individual matrix by as much as 6.35× and an average of 2× for a single pipelined dot-product unit over the baseline implementation. Although it is larger in area than the baseline, the implementation that exploits reuse even achieves a 2× computation reduction when compared to 3 concurrently-executing simpler accumulation units with the same aggregate FPGA design area. While the results in this paper reflect the opportunities of a specific signal processing problem, this work highlights the concept of exploiting computation reuse derived from a higher-level abstract representation at a mathematical and hardware level. As such, we believe this approach can also be leveraged in other signal recognition problems with specific well-characterized computational structures and signal dictionaries.