Streaming HD H.264 encoder on programmable processors
MM '09 Proceedings of the 17th ACM international conference on Multimedia
Highly parallel rate-distortion optimized intra-mode decision on multicore graphics processors
IEEE Transactions on Circuits and Systems for Video Technology
Intra frame encoding using programmable graphics hardware
PCM'07 Proceedings of the multimedia 8th Pacific Rim conference on Advances in multimedia information processing
A Multilevel Parallel Intra Coding for H.264/AVC Based on CUDA
ICIG '11 Proceedings of the 2011 Sixth International Conference on Image and Graphics
Fast H.264/AVC FRExt Intra Coding Using Belief Propagation
IEEE Transactions on Image Processing
Overview of the H.264/AVC video coding standard
IEEE Transactions on Circuits and Systems for Video Technology
Fast mode decision algorithm for intraprediction in H.264/AVC video coding
IEEE Transactions on Circuits and Systems for Video Technology
Intensity Gradient Technique for Efficient Intra-Prediction in H.264/AVC
IEEE Transactions on Circuits and Systems for Video Technology
An Efficient VLSI Architecture for Transform-Based Intra Prediction in H.264/AVC
IEEE Transactions on Circuits and Systems for Video Technology
A Parallel H.264 Encoder with CUDA: Mapping and Evaluation
ICPADS '12 Proceedings of the 2012 IEEE 18th International Conference on Parallel and Distributed Systems
Hi-index | 0.00 |
Recently, the power of the Graphics Processing Unit (GPU) has largely increased, whereas previous works of intra prediction on the GPU could not efficiently exploit the massive parallel opportunity. The related work only achieves frame-level, slice-level or block-level parallelism. It is a challenge to implement fine-grained parallelism on the Compute Unified Device Architecture (CUDA), such as pixel-level and mode-level, because the irregular formulas of intra prediction and the constraints posed by H.264/AVC cause significant branch instructions and the CUDA architecture is inherently not good at handling branches. In this paper, a CUDA-based approach that adopts fine-grained parallelism is presented. By transforming the various prediction formulas to the same form and introducing the predictor unit, an algorithm based on a lookup table is proposed to efficiently eliminate the branches. In addition, the combinatorial frame technique and the optimized encoding order are adopted to maximize the parallelism. Experimental results show that significant encoding time reduction can be achieved and the proposed algorithm outperforms previous works.