Coarse-Grained Parallel Algorithms for Multi-DimensionalWavelet Transforms
The Journal of Supercomputing
The lifting scheme: a construction of second generation wavelets
SIAM Journal on Mathematical Analysis
JPEG 2000: Image Compression Fundamentals, Standards and Practice
JPEG 2000: Image Compression Fundamentals, Standards and Practice
Parallel JPEG2000 Image Coding on Multiprocessors
IPDPS '02 Proceedings of the 16th International Parallel and Distributed Processing Symposium
Parallel Wavelet Transform for Large Scale Image Processing
IPDPS '02 Proceedings of the 16th International Parallel and Distributed Processing Symposium
The Vector Floating-Point Unit in a Synergistic Processor Element of a CELL Processor
ARITH '05 Proceedings of the 17th IEEE Symposium on Computer Arithmetic
Performance Comparison of SIMD Implementations of the Discrete Wavelet Transform
ASAP '05 Proceedings of the 2005 IEEE International Conference on Application-Specific Systems, Architecture Processors
The potential of the cell processor for scientific computing
Proceedings of the 3rd conference on Computing frontiers
A Single-Loop Approach to SIMD Parallelization of 2-D Wavelet Lifting
PDP '06 Proceedings of the 14th Euromicro International Conference on Parallel, Distributed, and Network-Based Processing
Introduction to the cell multiprocessor
IBM Journal of Research and Development - POWER5 and packaging
Parallel Algorithms for the Two-Dimensional Discrete Wavelet Transform
ICPP '94 Proceedings of the 1994 International Conference on Parallel Processing - Volume 03
Multilevel parallelization on the cell/B.E. for a motion JPEG 2000 encoding server
Proceedings of the 15th international conference on Multimedia
Optimizing JPEG2000 Still Image Encoding on the Cell Broadband Engine
ICPP '08 Proceedings of the 2008 37th International Conference on Parallel Processing
FFTC: fastest Fourier transform for the IBM cell broadband engine
HiPC'07 Proceedings of the 14th international conference on High performance computing
Analysis and architecture design of block-coding engine for EBCOT in JPEG 2000
IEEE Transactions on Circuits and Systems for Video Technology
Wavelet-Based Adaptive Solvers on Multi-core Architectures for the Simulation of Complex Systems
Euro-Par '09 Proceedings of the 15th International Euro-Par Conference on Parallel Processing
Multi-FFT Vectorization for the Cell Multicore Processor
CCGRID '10 Proceedings of the 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing
SIAM Journal on Scientific Computing
Euro-Par'11 Proceedings of the 2011 international conference on Parallel Processing
Hi-index | 0.00 |
Discrete transforms are of primary importance and fundamental kernels in many computationally intensive scientific applications. In this paper, we investigate the performance of two such algorithms; Fast Fourier Transform (FFT) and Discrete Wavelet Transform (DWT), on the Sony-Toshiba-IBM Cell Broadband Engine (Cell/B.E.), a heterogeneous multicore chip architected for intensive gaming applications and high performance computing. We design an efficient parallel implementation of Fast Fourier Transform (FFT) to fully exploit the architectural features of the Cell/B.E. Our FFT algorithm uses an iterative out-of-place approach and for 1K to 16K complex input samples outperforms all other parallel implementations of FFT on the Cell/B.E. including FFTW. Our FFT implementation obtains a single-precision performance of 18.6 GFLOP/s on the Cell/B.E., outperforming Intel Duo Core (Woodcrest) for inputs of greater than 2K samples. We also optimize Discrete Wavelet Transform (DWT) in the context of JPEG2000 for the Cell/B.E. DWT has an abundant parallelism, however, due to the low temporal locality of the algorithm, memory bandwidth becomes a significant bottleneck in achieving high performance. We introduce a novel data decomposition scheme to achieve highly efficient DMA data transfer and vectorization with low programming complexity. Also, we merge the multiple steps in the algorithm to reduce the bandwidth requirement. This leads to a significant enhancement in the scalability of the implementation. Our optimized implementation of DWT demonstrates 34 and 56 times speedup using one Cell/B.E. chip to the baseline code for the lossless and lossy transforms, respectively. We also provide the performance comparison with the AMD Barcelona (Quad-core Opteron) processor, and the Cell/B.E. excels the AMD Barcelona processor. This highlights the advantage of the Cell/B.E. over general purpose multicore processors in processing regular and bandwidth intensive scientific applications.