A fast Fourier transform compiler
Proceedings of the ACM SIGPLAN 1999 conference on Programming language design and implementation
Loop tiling for parallelism
The Physical Limits of Computing
Computing in Science and Engineering
High Performance FFT Algorithms for Cache-Coherent Multiprocessors
International Journal of High Performance Computing Applications
FFT program generation for shared memory: SMP and multicore
Proceedings of the 2006 ACM/IEEE conference on Supercomputing
Scheduling FFT computation on SMP and multicore systems
Proceedings of the 21st annual international conference on Supercomputing
Hardware Support for Efficient Sparse Matrix Vector Multiplication
EUC '08 Proceedings of the 2008 IEEE/IFIP International Conference on Embedded and Ubiquitous Computing - Volume 01
A Modified Split-Radix FFT With Fewer Arithmetic Operations
IEEE Transactions on Signal Processing
Hi-index | 0.00 |
As we enter the multi-core era, seeking methods to boost the performance of single-threaded applications remains critical. Achieving gains in processor performance by increasing the operating frequency has begun to meet more obstacles. However, significant performance improvements can be achieved by extending the capability of the processor with the addition of hardware support, which makes much more effective use of the available transistors. This paper presents a novel hardware support called, DistTree, to speed up processor performance. The DistTree hardware automates gather and scatter operations for applications with complex but predictable memory access patterns like the Fast Fourier Transform (FFT). With this hardware support integrated with a modern microprocessor (the Alpha architecture in our experiments), the FFT performance can reap a more than twofold increase when compared against the FFTW library, a state-of-the-art implementation. The DistTree hardware support enables the processor to spend the majority of processor cycles on executing the computations of an algorithm by reducing both the arithmetic and address computation overhead. Therefore, the performance of many single-threaded applications can be significantly increased.