Real-Time Motion Estimation and Visualization on Graphics Cards
VIS '04 Proceedings of the conference on Visualization '04
GPU-assisted decoding of video samples represented in the YCoCg-R color space
Proceedings of the 13th annual ACM international conference on Multimedia
GPGPU: general-purpose computation on graphics hardware
Proceedings of the 2006 ACM/IEEE conference on Supercomputing
Optimization principles and application performance evaluation of a multithreaded GPU using CUDA
Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming
Scalable Parallel Programming with CUDA
Queue - GPU Computing
Fast scan algorithms on graphics processors
Proceedings of the 22nd annual international conference on Supercomputing
Parallel variable-length encoding on GPGPUs
Euro-Par'09 Proceedings of the 2009 international conference on Parallel processing
Fast DCT domain filtering using the DCT and the DST
IEEE Transactions on Image Processing
Accelerate video decoding with generic GPU
IEEE Transactions on Circuits and Systems for Video Technology
Hi-index | 0.00 |
Modern GPUs excel in parallel computations, so they are an interesting target to perform matrix transformations such as the DCT, a fundamental part of MPEG video coding algorithms. Considering a system to encode synthetic video (e.g., computer-generated frames), this approach becomes even more appealing, since the images to encode are already in the GPU, eliminating the costs of transferring raw video from the CPU to the GPU. However, after a raw frame has been transformed and quantized by the GPU, the resulting coefficients must be reordered, entropy encoded and framed into the resulting MPEG bitstream. These last steps are essentially sequential and their straightforward GPU implementation is inefficient compared to CPU-based implementations. We present different approaches to implement part of these steps in GPU, aiming for a better usage of the memory bus, compensating the suboptimal use of the GPU with the gains in transfer time. We analyze three approaches to perform the zigzag scan and Huffman coding combining GPU and CPU, and two approaches to assemble the results to build the actual output bitstream both in GPU and CPU memory. Our experiments show that optimising the amount of data transferred from GPU to CPU implementing the last sequential compression steps in the GPU, and using a parallel fast scan implementation of the zigzag scanning improve the overall performance of the system. Savings in transfer time outweigh the extra cost incurred in the GPU.