Fine-grained CUDA-based Parallel Intra Prediction for H.264/AVC

  • Authors:
  • Wenbin Jiang;Min Long;Hai Jin;Pengcheng Wang

  • Affiliations:
  • Services Computing Technology and System Lab, School of Computer Science and Technology, Huazhong University of Science and Technology, Wuhan, 430074, China;Services Computing Technology and System Lab, School of Computer Science and Technology, Huazhong University of Science and Technology, Wuhan, 430074, China;Services Computing Technology and System Lab, School of Computer Science and Technology, Huazhong University of Science and Technology, Wuhan, 430074, China;Services Computing Technology and System Lab, School of Computer Science and Technology, Huazhong University of Science and Technology, Wuhan, 430074, China

  • Venue:
  • Proceedings of Network and Operating System Support on Digital Audio and Video Workshop
  • Year:
  • 2014

Quantified Score

Hi-index 0.00

Visualization

Abstract

Recently, the power of the Graphics Processing Unit (GPU) has largely increased, whereas previous works of intra prediction on the GPU could not efficiently exploit the massive parallel opportunity. The related work only achieves frame-level, slice-level or block-level parallelism. It is a challenge to implement fine-grained parallelism on the Compute Unified Device Architecture (CUDA), such as pixel-level and mode-level, because the irregular formulas of intra prediction and the constraints posed by H.264/AVC cause significant branch instructions and the CUDA architecture is inherently not good at handling branches. In this paper, a CUDA-based approach that adopts fine-grained parallelism is presented. By transforming the various prediction formulas to the same form and introducing the predictor unit, an algorithm based on a lookup table is proposed to efficiently eliminate the branches. In addition, the combinatorial frame technique and the optimized encoding order are adopted to maximize the parallelism. Experimental results show that significant encoding time reduction can be achieved and the proposed algorithm outperforms previous works.