A case for user-level dynamic page migration
Proceedings of the 14th international conference on Supercomputing
Thread clustering: sharing-aware scheduling on SMP-CMP-SMT multiprocessors
Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007
Slice-balancing H.264 video encoding for improved scalability of multicore decoding
EMSOFT '07 Proceedings of the 7th ACM & IEEE international conference on Embedded software
Efficient operating system scheduling for performance-asymmetric multi-core architectures
Proceedings of the 2007 ACM/IEEE conference on Supercomputing
Evaluation of data-parallel splitting approaches for H.264 decoding
Proceedings of the 6th International Conference on Advances in Mobile Computing and Multimedia
Parallel H.264 Decoding on an Embedded Multicore Processor
HiPEAC '09 Proceedings of the 4th International Conference on High Performance Embedded Architectures and Compilers
Novel approaches to parallel H.264 decoder on symmetric multicore systems
ICASSP '09 Proceedings of the 2009 IEEE International Conference on Acoustics, Speech and Signal Processing
Parallel Scalability of Video Decoders
Journal of Signal Processing Systems
A scalable parallel H.264 decoder on the cell broadband engine architecture
CODES+ISSS '09 Proceedings of the 7th IEEE/ACM international conference on Hardware/software codesign and system synthesis
Multicore processing and efficient on-chip caching for H.264 and future video decoders
IEEE Transactions on Circuits and Systems for Video Technology
Scalability of Macroblock-level Parallelism for H.264 Decoding
ICPADS '09 Proceedings of the 2009 15th International Conference on Parallel and Distributed Systems
Evaluation of parallel H.264 decoding strategies for the Cell Broadband Engine
Proceedings of the 24th ACM International Conference on Supercomputing
Evolution of thread-level parallelism in desktop applications
Proceedings of the 37th annual international symposium on Computer architecture
Efficient constant-time entropy decoding for H.264
Proceedings of the Conference on Design, Automation and Test in Europe
Parallelizing the H.264 decoder on the cell BE architecture
EMSOFT '10 Proceedings of the tenth ACM international conference on Embedded software
Overview of the H.264/AVC video coding standard
IEEE Transactions on Circuits and Systems for Video Technology
H.264/AVC baseline profile decoder complexity analysis
IEEE Transactions on Circuits and Systems for Video Technology
Programming parallel embedded and consumer applications in OpenMP superscalar
Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming
Amdahl's law for predicting the future of multicores considered harmful
ACM SIGARCH Computer Architecture News
Parallel HEVC Decoding on Multi- and Many-core Architectures
Journal of Signal Processing Systems
Analysis of dependence tracking algorithms for task dataflow execution
ACM Transactions on Architecture and Code Optimization (TACO)
Architectural Decomposition of Video Decoders by Meansof an Intermediate Data Stream Format
Journal of Signal Processing Systems
Hi-index | 0.00 |
Video coding follows the trend of demanding higher performance every new generation, and therefore could utilize many-cores. A complete parallelization of H.264, which is the most advanced video coding standard, was found to be difficult due to the complexity of the standard. In this paper a parallel implementation of a complete H.264 decoder is presented. Our parallelization strategy exploits function-level as well as data-level parallelism. Function-level parallelism is used to pipeline the H.264 decoding stages. Data-level parallelism is exploited within the two most time consuming stages, the entropy decoding stage and the macroblock decoding stage. The parallelization strategy has been implemented and optimized on three platforms with very different memory architectures, namely an 8-core SMP, a 64-core cc-NUMA, and an 18-core Cell platform. Evaluations have been performed using 4kx2k QHD sequences. On the SMP platform a maximum speedup of 4.5x is achieved. The SMP-implementation is reasonably performance portable as it achieves a speedup of 26.6x on the cc-NUMA system. However, to obtain the highest performance (speedup of 33.4x and throughput of 200 QHD frames per second), several cc-NUMA specific optimizations are necessary such as optimizing the page placement and statically assigning threads to cores. Finally, on the Cell platform a near ideal speedup of 16.5x is achieved by completely hiding the communication latency.