An FPGA-based VLIW processor with custom hardware execution
Proceedings of the 2005 ACM/SIGDA 13th international symposium on Field-programmable gate arrays
Extracting Speedup From C-Code With Poor Instruction-Level Parallelism
IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Workshop 14 - Volume 15
EURASIP Journal on Applied Signal Processing
Hi-index | 0.00 |
MPEG-4 is the latest multimedia coding standard that supports object-based coding and manipulation of natural video and synthetic graphics objects. Due to its various features and high coding efficiency, MPEG-4 is becoming popular in video streaming applications. Many graphics coprocessors provide the acceleration of inverse discrete cosine transform (IDCT) and motion compensation for real-time video decoding. Therefore, it is desired to use the graphics coprocessors to accelerate MPEG-4 video decoding as well. Since MPEG-4 video decoding for rectangular video objects is similar to other video coding standards, e.g., MPEG-2, the IDCT and motion compensation can still be executed on the graphics coprocessors. However, we have found that boundary macroblock padding, which is an essential processing step in decoding arbitrarily shaped video objects, could not be efficiently accelerated on the graphics coprocessors due to its complexity. Although we can implement the boundary macroblock padding on the host processor, the frame data processed on the graphics coprocessor need to be transferred to the host processor for padding. In addition, the padded data on the host processor need to be sent back to the graphics coprocessor to be used as a reference for subsequent frames. To avoid this overhead, we present two approaches of boundary macroblock padding. In the first approach, the boundary macroblock padding is partitioned into two tasks, one of which the host processor can perform without the overhead of data transfers. In the second approach, we propose two new instructions and an algorithm that can be easily adopted in the next-generation graphics coprocessors or mediaprocessors, which gives a performance improvement of up to a factor of nine compared to that with the Pentium III.