Domain-Specific Optimization of Signal Recognition Targeting FPGAs

Authors:
Melina Demertzi;Pedro C. Diniz;Mary W. Hall;Anna C. Gilbert;Yi Wang
Affiliations:
University of Southern California;INESC-ID, Lisboa;University of Utah;University of Michigan;University of Michigan
Venue:
ACM Transactions on Reconfigurable Technology and Systems (TRETS)
Year:
2011

Citing 21
Cited 0

Characterizing the behavior of sparse algorithms on caches

Proceedings of the 1992 ACM/IEEE conference on Supercomputing
Digital signal processing (3rd ed.): principles, algorithms, and applications

Digital signal processing (3rd ed.): principles, algorithms, and applications
A fast Fourier transform compiler

Proceedings of the ACM SIGPLAN 1999 conference on Programming language design and implementation
Templates for the solution of algebraic eigenvalue problems: a practical guide

Templates for the solution of algebraic eigenvalue problems: a practical guide
Digital filter synthesis based on minimal signed digit representation

Proceedings of the 38th annual Design Automation Conference
Synthesis of saturation arithmetic architectures

ACM Transactions on Design Automation of Electronic Systems (TODAES)
Implementing a Simple Continuous Speech Recognition System on an FPGA

FCCM '02 Proceedings of the 10th Annual IEEE Symposium on Field-Programmable Custom Computing Machines
A low-power accelerator for the SPHINX 3 speech recognition system

Proceedings of the 2003 international conference on Compilers, architecture and synthesis for embedded systems
FPGAs vs. CPUs: trends in peak floating-point performance

FPGA '04 Proceedings of the 2004 ACM/SIGDA 12th international symposium on Field programmable gate arrays
Sparse Matrix-Vector multiplication on FPGAs

Proceedings of the 2005 ACM/SIGDA 13th international symposium on Field-programmable gate arrays
Floating-point sparse matrix-vector multiply for FPGAs

Proceedings of the 2005 ACM/SIGDA 13th international symposium on Field-programmable gate arrays
64-bit floating-point FPGA matrix multiplication

Proceedings of the 2005 ACM/SIGDA 13th international symposium on Field-programmable gate arrays
One-Step Compilation of Image Processing Applications to FPGAs

FCCM '01 Proceedings of the the 9th Annual IEEE Symposium on Field-Programmable Custom Computing Machines
An FPGA-Based Coprocessor for the SPHINX Speech Recognition System: Early Experiences

RECONFIG '05 Proceedings of the 2005 International Conference on Reconfigurable Computing and FPGAs (ReConFig'05) on Reconfigurable Computing and FPGAs
Embedded floating-point units in FPGAs

Proceedings of the 2006 ACM/SIGDA 14th international symposium on Field programmable gate arrays
Compilers: Principles, Techniques, and Tools (2nd Edition)

Compilers: Principles, Techniques, and Tools (2nd Edition)
Generating FPGA-Accelerated DFT Libraries

FCCM '07 Proceedings of the 15th Annual IEEE Symposium on Field-Programmable Custom Computing Machines
Formal datapath representation and manipulation for implementing DSP transforms

Proceedings of the 45th annual Design Automation Conference
Synthesis and Optimization of 2D Filter Designs for Heterogeneous FPGAs

ACM Transactions on Reconfigurable Technology and Systems (TRETS)
FPGA-based Implementation of Signal Processing Systems

FPGA-based Implementation of Signal Processing Systems
Fast sparse matrix-vector multiplication by exploiting variable block structure

HPCC'05 Proceedings of the First international conference on High Performance Computing and Communications

Quantified Score

Hi-index	0.00

Visualization

Abstract

Domain-specific optimizations on matrix computations exploiting specific arithmetic and matrix representation formats have achieved significant performance/area gains in Field-Programmable Gate Array (FPGA) hardware designs. In this article, we explore the application of data-driven optimizations to reduce both storage and computation requirements to the problem of signal recognition from a known dictionary. By starting with a high-level mathematical representation of a signal recognition problem, we perform optimizations across the layers of the system, exploiting mathematical structure to improve implementation efficiency. Specifically, we use Walsh wavelet packets in conjunction with a BestBasis algorithm to distinguish between spoken digits. The resulting transform matrices are quite sparse, and exhibit a rich algebraic structure that contains significant overlap across rows. As a consequence, dot-product computations of the transform matrix and signal vectors exhibit significant computation reuse, or repeated identical computations. We present an algorithm for identifying this computation reuse and scheduling of the row computations. We exploit this reuse to derive FPGA hardware implementations that reduce the amount of computation for an individual matrix by as much as 6.35× and an average of 2× for a single dot-product unit. The implementation that exploits reuse achieves a 2× computation reduction compared to three concurrently-executing simpler accumulator units with the same aggregate design area and outperforms software implementations on high-end desktop personal computers.