A QHD-capable parallel H.264 decoder

Authors:
Chi Ching Chi;Ben Juurlink
Affiliations:
Technische Universität Berlin, Berlin, Germany;Technische Universität Berlin, Berlin, Germany
Venue:
Proceedings of the international conference on Supercomputing
Year:
2011

Citing 17
Cited 5

A case for user-level dynamic page migration

Proceedings of the 14th international conference on Supercomputing
Thread clustering: sharing-aware scheduling on SMP-CMP-SMT multiprocessors

Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007
Slice-balancing H.264 video encoding for improved scalability of multicore decoding

EMSOFT '07 Proceedings of the 7th ACM & IEEE international conference on Embedded software
Efficient operating system scheduling for performance-asymmetric multi-core architectures

Proceedings of the 2007 ACM/IEEE conference on Supercomputing
Evaluation of data-parallel splitting approaches for H.264 decoding

Proceedings of the 6th International Conference on Advances in Mobile Computing and Multimedia
Parallel H.264 Decoding on an Embedded Multicore Processor

HiPEAC '09 Proceedings of the 4th International Conference on High Performance Embedded Architectures and Compilers
Novel approaches to parallel H.264 decoder on symmetric multicore systems

ICASSP '09 Proceedings of the 2009 IEEE International Conference on Acoustics, Speech and Signal Processing
Parallel Scalability of Video Decoders

Journal of Signal Processing Systems
A scalable parallel H.264 decoder on the cell broadband engine architecture

CODES+ISSS '09 Proceedings of the 7th IEEE/ACM international conference on Hardware/software codesign and system synthesis
Multicore processing and efficient on-chip caching for H.264 and future video decoders

IEEE Transactions on Circuits and Systems for Video Technology
Scalability of Macroblock-level Parallelism for H.264 Decoding

ICPADS '09 Proceedings of the 2009 15th International Conference on Parallel and Distributed Systems
Evaluation of parallel H.264 decoding strategies for the Cell Broadband Engine

Proceedings of the 24th ACM International Conference on Supercomputing
Evolution of thread-level parallelism in desktop applications

Proceedings of the 37th annual international symposium on Computer architecture
Efficient constant-time entropy decoding for H.264

Proceedings of the Conference on Design, Automation and Test in Europe
Parallelizing the H.264 decoder on the cell BE architecture

EMSOFT '10 Proceedings of the tenth ACM international conference on Embedded software
Overview of the H.264/AVC video coding standard

IEEE Transactions on Circuits and Systems for Video Technology
H.264/AVC baseline profile decoder complexity analysis

IEEE Transactions on Circuits and Systems for Video Technology

Programming parallel embedded and consumer applications in OpenMP superscalar

Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming
Amdahl's law for predicting the future of multicores considered harmful

ACM SIGARCH Computer Architecture News
Parallel HEVC Decoding on Multi- and Many-core Architectures

Journal of Signal Processing Systems
Analysis of dependence tracking algorithms for task dataflow execution

ACM Transactions on Architecture and Code Optimization (TACO)
Architectural Decomposition of Video Decoders by Meansof an Intermediate Data Stream Format

Journal of Signal Processing Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

Video coding follows the trend of demanding higher performance every new generation, and therefore could utilize many-cores. A complete parallelization of H.264, which is the most advanced video coding standard, was found to be difficult due to the complexity of the standard. In this paper a parallel implementation of a complete H.264 decoder is presented. Our parallelization strategy exploits function-level as well as data-level parallelism. Function-level parallelism is used to pipeline the H.264 decoding stages. Data-level parallelism is exploited within the two most time consuming stages, the entropy decoding stage and the macroblock decoding stage. The parallelization strategy has been implemented and optimized on three platforms with very different memory architectures, namely an 8-core SMP, a 64-core cc-NUMA, and an 18-core Cell platform. Evaluations have been performed using 4kx2k QHD sequences. On the SMP platform a maximum speedup of 4.5x is achieved. The SMP-implementation is reasonably performance portable as it achieves a speedup of 26.6x on the cc-NUMA system. However, to obtain the highest performance (speedup of 33.4x and throughput of 200 QHD frames per second), several cc-NUMA specific optimizations are necessary such as optimizing the page placement and statically assigning threads to cores. Finally, on the Cell platform a near ideal speedup of 16.5x is achieved by completely hiding the communication latency.