Characterizing the behavior of sparse algorithms on caches
Proceedings of the 1992 ACM/IEEE conference on Supercomputing
Digital signal processing (3rd ed.): principles, algorithms, and applications
Digital signal processing (3rd ed.): principles, algorithms, and applications
A fast Fourier transform compiler
Proceedings of the ACM SIGPLAN 1999 conference on Programming language design and implementation
Templates for the solution of algebraic eigenvalue problems: a practical guide
Templates for the solution of algebraic eigenvalue problems: a practical guide
Digital filter synthesis based on minimal signed digit representation
Proceedings of the 38th annual Design Automation Conference
Synthesis of saturation arithmetic architectures
ACM Transactions on Design Automation of Electronic Systems (TODAES)
Implementing a Simple Continuous Speech Recognition System on an FPGA
FCCM '02 Proceedings of the 10th Annual IEEE Symposium on Field-Programmable Custom Computing Machines
A low-power accelerator for the SPHINX 3 speech recognition system
Proceedings of the 2003 international conference on Compilers, architecture and synthesis for embedded systems
FPGAs vs. CPUs: trends in peak floating-point performance
FPGA '04 Proceedings of the 2004 ACM/SIGDA 12th international symposium on Field programmable gate arrays
Sparse Matrix-Vector multiplication on FPGAs
Proceedings of the 2005 ACM/SIGDA 13th international symposium on Field-programmable gate arrays
Floating-point sparse matrix-vector multiply for FPGAs
Proceedings of the 2005 ACM/SIGDA 13th international symposium on Field-programmable gate arrays
64-bit floating-point FPGA matrix multiplication
Proceedings of the 2005 ACM/SIGDA 13th international symposium on Field-programmable gate arrays
One-Step Compilation of Image Processing Applications to FPGAs
FCCM '01 Proceedings of the the 9th Annual IEEE Symposium on Field-Programmable Custom Computing Machines
An FPGA-Based Coprocessor for the SPHINX Speech Recognition System: Early Experiences
RECONFIG '05 Proceedings of the 2005 International Conference on Reconfigurable Computing and FPGAs (ReConFig'05) on Reconfigurable Computing and FPGAs
Embedded floating-point units in FPGAs
Proceedings of the 2006 ACM/SIGDA 14th international symposium on Field programmable gate arrays
Compilers: Principles, Techniques, and Tools (2nd Edition)
Compilers: Principles, Techniques, and Tools (2nd Edition)
Generating FPGA-Accelerated DFT Libraries
FCCM '07 Proceedings of the 15th Annual IEEE Symposium on Field-Programmable Custom Computing Machines
Formal datapath representation and manipulation for implementing DSP transforms
Proceedings of the 45th annual Design Automation Conference
Synthesis and Optimization of 2D Filter Designs for Heterogeneous FPGAs
ACM Transactions on Reconfigurable Technology and Systems (TRETS)
FPGA-based Implementation of Signal Processing Systems
FPGA-based Implementation of Signal Processing Systems
Fast sparse matrix-vector multiplication by exploiting variable block structure
HPCC'05 Proceedings of the First international conference on High Performance Computing and Communications
Hi-index | 0.00 |
Domain-specific optimizations on matrix computations exploiting specific arithmetic and matrix representation formats have achieved significant performance/area gains in Field-Programmable Gate Array (FPGA) hardware designs. In this article, we explore the application of data-driven optimizations to reduce both storage and computation requirements to the problem of signal recognition from a known dictionary. By starting with a high-level mathematical representation of a signal recognition problem, we perform optimizations across the layers of the system, exploiting mathematical structure to improve implementation efficiency. Specifically, we use Walsh wavelet packets in conjunction with a BestBasis algorithm to distinguish between spoken digits. The resulting transform matrices are quite sparse, and exhibit a rich algebraic structure that contains significant overlap across rows. As a consequence, dot-product computations of the transform matrix and signal vectors exhibit significant computation reuse, or repeated identical computations. We present an algorithm for identifying this computation reuse and scheduling of the row computations. We exploit this reuse to derive FPGA hardware implementations that reduce the amount of computation for an individual matrix by as much as 6.35× and an average of 2× for a single dot-product unit. The implementation that exploits reuse achieves a 2× computation reduction compared to three concurrently-executing simpler accumulator units with the same aggregate design area and outperforms software implementations on high-end desktop personal computers.