ACM Computing Surveys (CSUR)
Loop fusion for clustered VLIW architectures
Proceedings of the joint conference on Languages, compilers and tools for embedded systems: software and compilers for embedded systems
Design Methodology for a Tightly Coupled VLIW/Reconfigurable Matrix Architecture: A Case Study
Proceedings of the conference on Design, automation and test in Europe - Volume 2
Loop scheduling with timing and switching-activity minimization for VLIW DSP
ACM Transactions on Design Automation of Electronic Systems (TODAES)
A novel generic fast Fourier transform pruning technique and complexity analysis
IEEE Transactions on Signal Processing
An effective memory addressing scheme for FFT processors
IEEE Transactions on Signal Processing
A robust channel estimator for high-mobility STBC-OFDM systems
IEEE Transactions on Circuits and Systems Part I: Regular Papers
Hi-index | 0.00 |
The PFFT (Partial FFT) is an extended FFT where only part of input or output bins are used. By pruning the useless dataflow, the PFFT can potentially achieve a significant speedup in many important applications. Although theoretical aspects of the PFFT have been thoroughly studied in past three decades, efficient implementations were rarely reported. The most important obstacle is the highly irregular dataflow and the associated control flow. In addition, a size-N PFFT has 2N dataflow possibilities, so that delivering both flexibility and efficiency in the same implementation is very challenging. This paper presents a generic scheme to map the highly irregular dataflow of arbitrary PFFT onto ILP architectures with highly efficient SWP (SoftWare-Pipelining). Constraints and opportunities of algorithms and architecture are carefully analyzed and exploited. We introduce a multi-phase partitioning, bringing heterogeneous control structures and heterogeneous software pipelining schemes to minimize control overheads and to maximize the efficiency of SWP. The proposal has been tested with 10 representative benchmarks extracted from baseband applications. In experiments cycle-counts, instructions, NOPs, L1D/L1P access/miss/hit are thoroughly analyzed. Comparing to full FFTs with efficient SWP, our work reduces 20.5% - 87.5% cycle-counts, 11.2% - 86.5% instructions, 16.1% - 79.4% L1D cache accesses and 19.5% - 87.1% L1P cache accesses. To the best of our knowledge, this is the first reported work about the generic software-pipelined PFFT on ILP architectures.