Reducing 3D Fast Wavelet Transform Execution Time Using Blocking and the Streaming SIMD Extensions

Authors:
Gregorio Bernabé;José M. García;José González
Affiliations:
Dpto. Ingeniería y Tecnología de Computadores, Universidad de Murcia, Murcia, Spain 30071;Dpto. Ingeniería y Tecnología de Computadores, Universidad de Murcia, Murcia, Spain 30071;Intel Barcelona Research Center, Intel Labs, Barcelona, Spain 08034
Venue:
Journal of VLSI Signal Processing Systems
Year:
2005

Citing 22
Cited 4

A Theory for Multiresolution Signal Decomposition: The Wavelet Representation

IEEE Transactions on Pattern Analysis and Machine Intelligence
A set of level 3 basic linear algebra subprograms

ACM Transactions on Mathematical Software (TOMS)
The cache performance and optimizations of blocked algorithms

ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
Ten lectures on wavelets

Ten lectures on wavelets
Compressing still and moving images with wavelets

Multimedia Systems - Special issue on video compression
Performance of image and video processing with general-purpose processors and media ISA extensions

ISCA '99 Proceedings of the 26th annual international symposium on Computer architecture
Tiling imperfectly-nested loop nests

Proceedings of the 2000 ACM/IEEE conference on Supercomputing
Blocking and array contraction across arbitrarily nested loops using affine partitioning

PPoPP '01 Proceedings of the eighth ACM SIGPLAN symposium on Principles and practices of parallel programming
High Performance Compilers for Parallel Computing

High Performance Compilers for Parallel Computing
Multiscale Volume Representation by a DoG Wavelet

IEEE Transactions on Visualization and Computer Graphics
MPEG-4: A Multimedia Standard for the Third Millennium, Part 1

IEEE MultiMedia
MPEG-4: A Multimedia Standard for the Third Millennium, Part 2

IEEE MultiMedia
Internet Streaming SIMD Extensions

Computer
An Embedded Wavelet Video Coder Using Three-Dimensional Set Partitioning in Hierarchical Trees (SPIHT)

DCC '97 Proceedings of the Conference on Data Compression
An Overview of JPEG-2000

DCC '00 Proceedings of the Conference on Data Compression
The Long And Winding Road to High-Performance Image Processing with MMX/SSE

CAMP '00 Proceedings of the Fifth IEEE International Workshop on Computer Architectures for Machine Perception (CAMP'00)
Intel's MMXTM Technology - A New Instruction Set Extension

COMPCON '97 Proceedings of the 42nd IEEE International Computer Conference
LAPACK Working Note 20: A Portable Linear Algebra Library For High-Performance Computers

LAPACK Working Note 20: A Portable Linear Algebra Library For High-Performance Computers
Approximation and rendering of volume data using wavelet transforms

VIS '92 Proceedings of the 3rd conference on Visualization '92
Stripe-based SPHIT lossy compression of volumetric medical images for low memory usage and uniform reconstruction quality

ICASSP '00 Proceedings of the Acoustics, Speech, and Signal Processing, 2000. on IEEE International Conference - Volume 04
3D scan-based wavelet transform and quality control for video coding

EURASIP Journal on Applied Signal Processing
Line-based, reduced memory, wavelet image compression

IEEE Transactions on Image Processing

An efficient implementation of a 3D wavelet transform based encoder on hyper-threading technology

Parallel Computing
Exploiting Varying Resource Requirements in Wavelet-based Applications in Dynamic Execution Environments

Journal of Signal Processing Systems
Scan-based wavelet transform for huge 3D volume data

PCS'09 Proceedings of the 27th conference on Picture Coding Symposium
Modeling and exploiting spatial locality trade-offs in wavelet-based applications under varying resource requirements

ACM Transactions on Embedded Computing Systems (TECS)

Quantified Score

Hi-index	0.00

Visualization

Abstract

The video compression algorithms based on the 3D wavelet transform obtain excellent compression rates at the expense of huge memory requirements, that drastically affects the execution time of such applications. Its objective is to allow the real-time video compression based on the 3D fast wavelet transform. We show the hardware and software interaction for this multimedia application on a general-purpose processor. First, we mitigate the memory problem by exploiting the memory hierarchy of the processor using several techniques. As for instance, we implement and evaluate the blocking technique. We present two blocking approaches in particular: cube and rectangular, both of which differ in the way the original working set is divided. We also put forward the reuse of previous computations in order to decrease the number of memory accesses and floating point operations. Afterwards, we present several optimizations that cannot be applied by the compiler due to the characteristics of the algorithm. On the one hand, the Streaming SIMD Extensions (SSE) are used for some of the dimensions of the sequence (y and time), to reduce the number of floating point instructions, exploiting Data Level Parallelism. Then, we apply loop unrolling and data prefetching to specific parts of the code. On the other hand, the algorithm is vectorized by columns, allowing the use of SIMD instructions for the y dimension. Results show speedups of 5x in the execution time over a version compiled with the maximum optimizations of the Intel C/C++ compiler, maintaining the compression ratio and the video quality (PSNR) of the original encoder based on the 3D wavelet transform. Our experiments also show that, allowing the compiler to perform some of these optimizations (i.e. automatic code vectorization), causes performance slowdown, demonstrating the effectiveness of our optimizations.