Parallel strategies for 2D Discrete Wavelet Transform in shared memory systems and GPUs

Authors:
V. Galiano;O. López;M. P. Malumbres;H. Migallón
Affiliations:
Physics and Computer Architecture Department, Miguel Hernández University, Elche, Spain 03202;Physics and Computer Architecture Department, Miguel Hernández University, Elche, Spain 03202;Physics and Computer Architecture Department, Miguel Hernández University, Elche, Spain 03202;Physics and Computer Architecture Department, Miguel Hernández University, Elche, Spain 03202
Venue:
The Journal of Supercomputing
Year:
2013

Citing 9
Cited 0

A Theory for Multiresolution Signal Decomposition: The Wavelet Representation

IEEE Transactions on Pattern Analysis and Machine Intelligence
Discrete cosine transform: algorithms, advantages, applications

Discrete cosine transform: algorithms, advantages, applications
The lifting scheme: a construction of second generation wavelets

SIAM Journal on Mathematical Analysis
OpenGL(R) Shading Language (2nd Edition)

OpenGL(R) Shading Language (2nd Edition)
Scalable Parallel Programming with CUDA

Queue - GPU Computing
Embedded image coding using zerotrees of wavelet coefficients

IEEE Transactions on Signal Processing
Line-based, reduced memory, wavelet image compression

IEEE Transactions on Image Processing
A new, fast, and efficient image codec based on set partitioning in hierarchical trees

IEEE Transactions on Circuits and Systems for Video Technology
Design of wavelet-based image codec in memory-constrained environment

IEEE Transactions on Circuits and Systems for Video Technology

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this work, we analyze the behavior of several parallel algorithms developed to compute the two-dimensional discrete wavelet transform using both OpenMP over a multicore platform and CUDA over a GPU. The proposed parallel algorithms are based on both regular filter-bank convolution and lifting transform with small implementations changes focused on both the memory requirements reduction and the complexity reduction. We compare our implementations against sequential CPU algorithms and other recently proposed algorithms like the SMDWT algorithm over different CPUs and the Wippig&Klauer algorithm over a GTX280 GPU. Finally, we analyze their behavior when algorithms are adapted to each architecture. Significant execution times improvements are achieved on both multicore platforms and GPUs. Depending on the multicore platform used, we achieve speed-ups of 1.9 and 3.4 using two and four processes, respectively, when compared to the sequential CPU algorithm, or we obtain speed-ups of 7.1 and 8.9 using eight and ten processes. Regarding GPUs, the GPU convolution algorithm using the GPU shared memory obtains speed-ups up to 20 when compared to the CPU sequential algorithm.