The Vector Floating-Point Unit in a Synergistic Processor Element of a CELL Processor
ARITH '05 Proceedings of the 17th IEEE Symposium on Computer Arithmetic
The potential of the cell processor for scientific computing
Proceedings of the 3rd conference on Computing frontiers
Introduction to the cell multiprocessor
IBM Journal of Research and Development - POWER5 and packaging
Real-time supercomputing and technology for games and entertainment
Proceedings of the 2006 ACM/IEEE conference on Supercomputing
Cell-SWat: modeling and scheduling wavefront computations on the cell broadband engine
Proceedings of the 5th conference on Computing frontiers
Dma-based prefetching for i/o-intensive workloads on the cell architecture
Proceedings of the 5th conference on Computing frontiers
Implementing Wilson-Dirac operator on the cell broadband engine
Proceedings of the 22nd annual international conference on Supercomputing
High performance discrete Fourier transforms on graphics processors
Proceedings of the 2008 ACM/IEEE conference on Supercomputing
Parallel exact inference on the cell broadband engine processor
Proceedings of the 2008 ACM/IEEE conference on Supercomputing
Data Mining Algorithms on the Cell Broadband Engine
Euro-Par '08 Proceedings of the 14th international Euro-Par conference on Parallel Processing
Parallel Lattice Boltzmann Flow Simulation on Emerging Multi-core Platforms
Euro-Par '08 Proceedings of the 14th international Euro-Par conference on Parallel Processing
Scientific Programming - High Performance Computing with the Cell Broadband Engine
Computing discrete transforms on the Cell Broadband Engine
Parallel Computing
Supporting MapReduce on large-scale asymmetric multi-core clusters
ACM SIGOPS Operating Systems Review
Computer generation of fast fourier transforms for the cell broadband engine
Proceedings of the 23rd international conference on Supercomputing
Carbon nanotube coated high-throughput neurointerfaces in assistive environments
Proceedings of the 2nd International Conference on PErvasive Technologies Related to Assistive Environments
Parallel exact inference on the Cell Broadband Engine processor
Journal of Parallel and Distributed Computing
Optimization of BLAS on the cell processor
HiPC'08 Proceedings of the 15th international conference on High performance computing
State-of-the-art in heterogeneous computing
Scientific Programming
Designing Accelerator-Based Distributed Systems for High Performance
CCGRID '10 Proceedings of the 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing
Efficient parallel selective separable-kernel convolution on heterogeneous processors
Proceedings of the Second International Forum on Next-Generation Multicore/Manycore Technologies
A capabilities-aware framework for using computational accelerators in data-intensive computing
Journal of Parallel and Distributed Computing
BrickX: building hybrid systems for recursive computations
ACM SIGMETRICS Performance Evaluation Review
Real-time disparity map computation using the cell broadband engine
Journal of Real-Time Image Processing
An FFT performance model for optimizing general-purpose processor architecture
Journal of Computer Science and Technology - Special issue on Community Analysis and Information Recommendation
A transpose-free in-place SIMD optimized FFT
ACM Transactions on Architecture and Code Optimization (TACO)
Ultrasound simulation on the cell broadband engine using the westervelt equation
ICA3PP'12 Proceedings of the 12th international conference on Algorithms and Architectures for Parallel Processing - Volume Part I
A Simple Compressive Sensing Algorithm for Parallel Many-Core Architectures
Journal of Signal Processing Systems
Hi-index | 0.01 |
The Fast Fourier Transform (FFT) is of primary importance and a fundamental kernel in many computationally intensive scientific applications. In this paper we investigate its performance on the Sony-Toshiba-IBM Cell Broadband Engine, a heterogeneous multicore chip architected for intensive gaming applications and high performance computing. The Cell processor consists of a traditional microprocessor (called the PPE) that controls eight SIMD co-processing units called synergistic processor elements (SPEs). We exploit the architectural features of the Cell processor to design an efficient parallel implementation of Fast Fourier Transform (FFT). While there have been several attempts to develop a fast implementation of FFT on the Cell, none have been able to achieve high performance for input series with several thousand complex points. We use an iterative out-of-place approach to design our parallel implementation of FFT with 1K to 16K complex input samples and attain a single precision performance of 18.6 GFLOP/s on the Cell. Our implementation beats FFTW on Cell by several GFLOP/s for these input sizes and outperforms Intel Duo Core (Woodcrest) for inputs of greater than 2K samples. To our knowledge we have the fastest FFT for this range of complex inputs.