Computing discrete transforms on the Cell Broadband Engine

Authors:
David A. Bader;Virat Agarwal;Seunghwa Kang
Affiliations:
Georgia Institute of Technology, College of Computing, Atlanta, GA 30332, USA;Georgia Institute of Technology, College of Computing, Atlanta, GA 30332, USA;Georgia Institute of Technology, College of Computing, Atlanta, GA 30332, USA
Venue:
Parallel Computing
Year:
2009

Citing 16
Cited 4

Coarse-Grained Parallel Algorithms for Multi-DimensionalWavelet Transforms

The Journal of Supercomputing
The lifting scheme: a construction of second generation wavelets

SIAM Journal on Mathematical Analysis
JPEG 2000: Image Compression Fundamentals, Standards and Practice

JPEG 2000: Image Compression Fundamentals, Standards and Practice
Parallel JPEG2000 Image Coding on Multiprocessors

IPDPS '02 Proceedings of the 16th International Parallel and Distributed Processing Symposium
Parallel Wavelet Transform for Large Scale Image Processing

IPDPS '02 Proceedings of the 16th International Parallel and Distributed Processing Symposium
The Vector Floating-Point Unit in a Synergistic Processor Element of a CELL Processor

ARITH '05 Proceedings of the 17th IEEE Symposium on Computer Arithmetic
Performance Comparison of SIMD Implementations of the Discrete Wavelet Transform

ASAP '05 Proceedings of the 2005 IEEE International Conference on Application-Specific Systems, Architecture Processors
The potential of the cell processor for scientific computing

Proceedings of the 3rd conference on Computing frontiers
A Single-Loop Approach to SIMD Parallelization of 2-D Wavelet Lifting

PDP '06 Proceedings of the 14th Euromicro International Conference on Parallel, Distributed, and Network-Based Processing
Introduction to the cell multiprocessor

IBM Journal of Research and Development - POWER5 and packaging
Cell Multiprocessor Communication Network: Built for Speed

IEEE Micro
Parallel Algorithms for the Two-Dimensional Discrete Wavelet Transform

ICPP '94 Proceedings of the 1994 International Conference on Parallel Processing - Volume 03
Multilevel parallelization on the cell/B.E. for a motion JPEG 2000 encoding server

Proceedings of the 15th international conference on Multimedia
Optimizing JPEG2000 Still Image Encoding on the Cell Broadband Engine

ICPP '08 Proceedings of the 2008 37th International Conference on Parallel Processing
FFTC: fastest Fourier transform for the IBM cell broadband engine

HiPC'07 Proceedings of the 14th international conference on High performance computing
Analysis and architecture design of block-coding engine for EBCOT in JPEG 2000

IEEE Transactions on Circuits and Systems for Video Technology

Wavelet-Based Adaptive Solvers on Multi-core Architectures for the Simulation of Complex Systems

Euro-Par '09 Proceedings of the 15th International Euro-Par Conference on Parallel Processing
Multi-FFT Vectorization for the Cell Multicore Processor

CCGRID '10 Proceedings of the 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing
Multicore/Multi-GPU Accelerated Simulations of Multiphase Compressible Flows Using Wavelet Adapted Grids

SIAM Journal on Scientific Computing
Two-Dimensional discrete wavelet transform on large images for hybrid computing architectures: GPU and CELL

Euro-Par'11 Proceedings of the 2011 international conference on Parallel Processing

Quantified Score

Hi-index	0.00

Visualization

Abstract

Discrete transforms are of primary importance and fundamental kernels in many computationally intensive scientific applications. In this paper, we investigate the performance of two such algorithms; Fast Fourier Transform (FFT) and Discrete Wavelet Transform (DWT), on the Sony-Toshiba-IBM Cell Broadband Engine (Cell/B.E.), a heterogeneous multicore chip architected for intensive gaming applications and high performance computing. We design an efficient parallel implementation of Fast Fourier Transform (FFT) to fully exploit the architectural features of the Cell/B.E. Our FFT algorithm uses an iterative out-of-place approach and for 1K to 16K complex input samples outperforms all other parallel implementations of FFT on the Cell/B.E. including FFTW. Our FFT implementation obtains a single-precision performance of 18.6 GFLOP/s on the Cell/B.E., outperforming Intel Duo Core (Woodcrest) for inputs of greater than 2K samples. We also optimize Discrete Wavelet Transform (DWT) in the context of JPEG2000 for the Cell/B.E. DWT has an abundant parallelism, however, due to the low temporal locality of the algorithm, memory bandwidth becomes a significant bottleneck in achieving high performance. We introduce a novel data decomposition scheme to achieve highly efficient DMA data transfer and vectorization with low programming complexity. Also, we merge the multiple steps in the algorithm to reduce the bandwidth requirement. This leads to a significant enhancement in the scalability of the implementation. Our optimized implementation of DWT demonstrates 34 and 56 times speedup using one Cell/B.E. chip to the baseline code for the lossless and lossy transforms, respectively. We also provide the performance comparison with the AMD Barcelona (Quad-core Opteron) processor, and the Cell/B.E. excels the AMD Barcelona processor. This highlights the advantage of the Cell/B.E. over general purpose multicore processors in processing regular and bandwidth intensive scientific applications.