Computational frameworks for the fast Fourier transform
Computational frameworks for the fast Fourier transform
Formal loop merging for signal transforms
Proceedings of the 2005 ACM SIGPLAN conference on Programming language design and implementation
FFT program generation for shared memory: SMP and multicore
Proceedings of the 2006 ACM/IEEE conference on Supercomputing
Computer Generation of General Size Linear Transform Libraries
Proceedings of the 7th annual IEEE/ACM International Symposium on Code Generation and Optimization
A rewriting system for the vectorization of signal transforms
VECPAR'06 Proceedings of the 7th international conference on High performance computing for computational science
FFTC: fastest Fourier transform for the IBM cell broadband engine
HiPC'07 Proceedings of the 14th international conference on High performance computing
Multi-FFT Vectorization for the Cell Multicore Processor
CCGRID '10 Proceedings of the 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing
Partitioning streaming parallelism for multi-cores: a machine learning based approach
Proceedings of the 19th international conference on Parallel architectures and compilation techniques
Auto-tuning of fast fourier transform on graphics processors
Proceedings of the 16th ACM symposium on Principles and practice of parallel programming
An FFT performance model for optimizing general-purpose processor architecture
Journal of Computer Science and Technology - Special issue on Community Analysis and Information Recommendation
A transpose-free in-place SIMD optimized FFT
ACM Transactions on Architecture and Code Optimization (TACO)
Hi-index | 0.00 |
The Cell BE is a multicore processor with eight vector accelerators (called SPEs) that implement explicit cache management through direct memory access engines. While the Cell has an impressive floating point peak performance, programming and optimizing for it is difficult as it requires explicit memory management, multithreading, streaming, and vectorization. We address this problem for the discrete Fourier transform (DFT) by extending Spiral, a program generation system, to automatically generate highly optimized implementations for the Cell. The extensions include multi-SPE parallelization and explicit memory streaming, both performed at a high abstraction level using rewriting systems operating on Spiral's internal domain-specific language. Further, we support latency and throughput optimizations, single and double precision, and different data formats. The performance of Spiral's computer generated code is comparable with and sometimes better than existing DFT implementations, where available.