A Theory for Multiresolution Signal Decomposition: The Wavelet Representation
IEEE Transactions on Pattern Analysis and Machine Intelligence
A set of level 3 basic linear algebra subprograms
ACM Transactions on Mathematical Software (TOMS)
The cache performance and optimizations of blocked algorithms
ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
Ten lectures on wavelets
Compressing still and moving images with wavelets
Multimedia Systems - Special issue on video compression
Performance of image and video processing with general-purpose processors and media ISA extensions
ISCA '99 Proceedings of the 26th annual international symposium on Computer architecture
Tiling imperfectly-nested loop nests
Proceedings of the 2000 ACM/IEEE conference on Supercomputing
Blocking and array contraction across arbitrarily nested loops using affine partitioning
PPoPP '01 Proceedings of the eighth ACM SIGPLAN symposium on Principles and practices of parallel programming
High Performance Compilers for Parallel Computing
High Performance Compilers for Parallel Computing
Multiscale Volume Representation by a DoG Wavelet
IEEE Transactions on Visualization and Computer Graphics
MPEG-4: A Multimedia Standard for the Third Millennium, Part 1
IEEE MultiMedia
MPEG-4: A Multimedia Standard for the Third Millennium, Part 2
IEEE MultiMedia
Internet Streaming SIMD Extensions
Computer
DCC '97 Proceedings of the Conference on Data Compression
DCC '00 Proceedings of the Conference on Data Compression
The Long And Winding Road to High-Performance Image Processing with MMX/SSE
CAMP '00 Proceedings of the Fifth IEEE International Workshop on Computer Architectures for Machine Perception (CAMP'00)
Intel's MMXTM Technology - A New Instruction Set Extension
COMPCON '97 Proceedings of the 42nd IEEE International Computer Conference
LAPACK Working Note 20: A Portable Linear Algebra Library For High-Performance Computers
LAPACK Working Note 20: A Portable Linear Algebra Library For High-Performance Computers
Approximation and rendering of volume data using wavelet transforms
VIS '92 Proceedings of the 3rd conference on Visualization '92
ICASSP '00 Proceedings of the Acoustics, Speech, and Signal Processing, 2000. on IEEE International Conference - Volume 04
3D scan-based wavelet transform and quality control for video coding
EURASIP Journal on Applied Signal Processing
Line-based, reduced memory, wavelet image compression
IEEE Transactions on Image Processing
Journal of Signal Processing Systems
Scan-based wavelet transform for huge 3D volume data
PCS'09 Proceedings of the 27th conference on Picture Coding Symposium
ACM Transactions on Embedded Computing Systems (TECS)
Hi-index | 0.00 |
The video compression algorithms based on the 3D wavelet transform obtain excellent compression rates at the expense of huge memory requirements, that drastically affects the execution time of such applications. Its objective is to allow the real-time video compression based on the 3D fast wavelet transform. We show the hardware and software interaction for this multimedia application on a general-purpose processor. First, we mitigate the memory problem by exploiting the memory hierarchy of the processor using several techniques. As for instance, we implement and evaluate the blocking technique. We present two blocking approaches in particular: cube and rectangular, both of which differ in the way the original working set is divided. We also put forward the reuse of previous computations in order to decrease the number of memory accesses and floating point operations. Afterwards, we present several optimizations that cannot be applied by the compiler due to the characteristics of the algorithm. On the one hand, the Streaming SIMD Extensions (SSE) are used for some of the dimensions of the sequence (y and time), to reduce the number of floating point instructions, exploiting Data Level Parallelism. Then, we apply loop unrolling and data prefetching to specific parts of the code. On the other hand, the algorithm is vectorized by columns, allowing the use of SIMD instructions for the y dimension. Results show speedups of 5x in the execution time over a version compiled with the maximum optimizations of the Intel C/C++ compiler, maintaining the compression ratio and the video quality (PSNR) of the original encoder based on the 3D wavelet transform. Our experiments also show that, allowing the compiler to perform some of these optimizations (i.e. automatic code vectorization), causes performance slowdown, demonstrating the effectiveness of our optimizations.