Efficient orchestration of sub-word parallelism in media processors

Authors:
John Oliver;Venkatesh Akella;Frederic Chong
Affiliations:
University of California at Davis, CA;University of California at Davis, CA;University of California at Davis, CA
Venue:
Proceedings of the sixteenth annual ACM symposium on Parallelism in algorithms and architectures
Year:
2004

Citing 12
Cited 2

Optimum Broadcasting and Personalized Communication in Hypercubes

IEEE Transactions on Computers
Design and evaluation of dynamic access ordering hardware

ICS '96 Proceedings of the 10th international conference on Supercomputing
Communication scheduling

ASPLOS IX Proceedings of the ninth international conference on Architectural support for programming languages and operating systems
Computer architecture: a quantitative approach

Computer architecture: a quantitative approach
How Multimedia Workloads Will Change Processor Design

Computer
MMX Technology Extension to the Intel Architecture

IEEE Micro
Subword Parallelism with MAX-2

IEEE Micro
The TigerSHARC DSP Architecture

IEEE Micro
AltiVec Extension to PowerPC Accelerates Media Processing

IEEE Micro
Datapath design for a VLIW Video Signal Processor

HPCA '97 Proceedings of the 3rd IEEE Symposium on High-Performance Computer Architecture
Scalar Operand Networks: On-Chip Interconnect for ILP in Partitioned Architectures

HPCA '03 Proceedings of the 9th International Symposium on High-Performance Computer Architecture
A design study of a 0.25-μm video signal processor

IEEE Transactions on Circuits and Systems for Video Technology

Matrix register file and extended subwords: two techniques for embedded media processors

Proceedings of the 2nd conference on Computing frontiers
Avoiding conversion and rearrangement overhead in SIMD architectures

International Journal of Parallel Programming

Quantified Score

Hi-index	0.00

Visualization

Abstract

Communication and multimedia applications with increased data rates and enhanced functionality continuously raise the bar for the computational requirements of future microprocessors. In order to meet these computational demands it is necessary to exploit sub-word parallelism efficiently. We propose to make sub-word data movement a first-class operation in microprocessor architectures by introducing a Sub-word Permutation Unit (SPU)in the execution pipeline. The SPU is evaluated in the context of the MMX media co-processor for the Intel Pentium architectures, but our results can be extended to any processor that supports sub-word parallelism. We find that the SPU all ws us to orchestrate sub-word data placement prior to computation, thus all wing the MMX functional units to concentrate on performing calculations. Furthermore, we introduce a decoupled SPU control mechanism at the basic block level which allows static optimization to eliminate data-movement verhead in tight loops, where most media and signal processing occurs. We demonstrated that anywhere from 4% to 20% improvement can be obtained on key media and signal processing kernels with as little as 1% increase in hardware resources.