A Five-Stage Pipeline, 204 Cycles/MB, Single-Port SRAM-Based Deblocking Filter for H.264/AVC

Authors:
Ke Xu;Chiu-Sing Choy
Affiliations:
Chinese Univ. of Hong Kong, Hong Kong;-
Venue:
IEEE Transactions on Circuits and Systems for Video Technology
Year:
2008

Citing 0
Cited 8

A 136 cycles/MB, luma-chroma parallelized H.264/AVC deblocking filter for QFHD applications

ICME'09 Proceedings of the 2009 IEEE international conference on Multimedia and Expo
Hardware design of motion data decoding process for H.264/AVC

Image Communication
Methods for Power/Throughput/Area Optimization of H.264/AVC Decoding

Journal of Signal Processing Systems
Exploiting parallelism in the H.264 deblocking filter by operation reordering

ICA3PP'11 Proceedings of the 11th international conference on Algorithms and architectures for parallel processing - Volume Part I
FPGA based efficient on-chip memory for image processing algorithms

Microelectronics Journal
De-blocking filter design for HEVC and H.264/AVC

PCM'12 Proceedings of the 13th Pacific-Rim conference on Advances in Multimedia Information Processing
135-MHz 258-K gates VLSI design for all-intra H.264/AVC scalable video encoder

IEEE Transactions on Very Large Scale Integration (VLSI) Systems
A Very High Throughput Deblocking Filter for H.264/AVC

Journal of Signal Processing Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper describes the design and VLSI implementation of a highly efficient, single-port SRAM-based deblocking filter. It can achieve 204 cycles/macroblock throughput for H.264/AVC real-time decoding. Several deblocking filter designs in the literature have been compared and the possibility of realizing them in a pipeline is studied. Eventually we came up with a completely new design which has a five-stage pipeline with gated clock to increase system throughput while reducing power. Data hazards and structure hazards, which are the two most critical issues for a pipelined filter, are analyzed and resolved. Efficient memory organization for both on-chip SRAM and transposition buffers is employed. By using innovative hybrid edge filtering sequence and out-of-order memory update scenario, we obtain zero stall cycle in normal pipeline flow, making the best out of a pipelined architecture. Compared with existing designs, our design achieves at least 18% clock cycle reduction, as well as 20% lower power consumption owing to its efficient pipeline and memory architecture. The total gate count is comparable to other designs in literature without using any expensive two-port or dual-port on-chip SRAMs.