Bottlenecks in Multimedia Processing with SIMD Style Extensions and Architectural Enhancements

Authors:
Deepu Talla;Lizy Kurian John;Doug Burger
Affiliations:
-;-;-
Venue:
IEEE Transactions on Computers
Year:
2003

Citing 32
Cited 29

A Simulation Study of Decoupled Architecture Computers

IEEE Transactions on Computers
The TI Advanced Scientific Computer

Computer
Evaluation of the WM architecture

ISCA '92 Proceedings of the 19th annual international symposium on Computer architecture
Maximizing memory bandwidth for streamed computations

Maximizing memory bandwidth for streamed computations
MediaBench: a tool for evaluating and synthesizing multimedia and communicatons systems

MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
Parallelizing compiler techniques based on linear inequalities

Parallelizing compiler techniques based on linear inequalities
Performance modeling and code partitioning for the DS architecture

Proceedings of the 25th annual international symposium on Computer architecture
A bandwidth-efficient architecture for media processing

MICRO 31 Proceedings of the 31st annual ACM/IEEE international symposium on Microarchitecture
Simple vector microprocessors for multimedia applications

MICRO 31 Proceedings of the 31st annual ACM/IEEE international symposium on Microarchitecture
Evaluating MMX technology using DSP and multimedia applications

MICRO 31 Proceedings of the 31st annual ACM/IEEE international symposium on Microarchitecture
PipeRench: a co/processor for streaming multimedia acceleration

ISCA '99 Proceedings of the 26th annual international symposium on Computer architecture
Performance of image and video processing with general-purpose processors and media ISA extensions

ISCA '99 Proceedings of the 26th annual international symposium on Computer architecture
Exploiting SIMD parallelism in DSP and multimedia algorithms using the AltiVec technology

ICS '99 Proceedings of the 13th international conference on Supercomputing
Exploiting a new level of DLP in multimedia applications

Proceedings of the 32nd annual ACM/IEEE international symposium on Microarchitecture
PIPE: a VLSI decoupled architecture

ISCA '85 Proceedings of the 12th annual international symposium on Computer architecture
Reconfigurable caches and their application to media processing

Proceedings of the 27th annual international symposium on Computer architecture
CHIMAERA: a high-performance architecture with a tightly-coupled reconfigurable functional unit

Proceedings of the 27th annual international symposium on Computer architecture
Decoupled access/execute computer architectures

ACM Transactions on Computer Systems (TOCS)
Flexible hardware acceleration for multimedia oriented microprocessors

Proceedings of the 33rd annual ACM/IEEE international symposium on Microarchitecture
DSP Processor Fundamentals: Architectures and Features

DSP Processor Fundamentals: Architectures and Features
MediaBreeze: a decoupled architecture for accelerating multimedia applications

ACM SIGARCH Computer Architecture News - Special Issue: PACT 2001 workshops
High Performance Compilers for Parallel Computing

High Performance Compilers for Parallel Computing
A Vectorizing Compiler for Multimedia Extensions

International Journal of Parallel Programming
Challenges to Combining General-Purpose and Multimedia Processors

Computer
Internet Streaming SIMD Extensions

Computer
The TigerSHARC DSP Architecture

IEEE Micro
AltiVec Extension to PowerPC Accelerates Media Processing

IEEE Micro
Dynamic Parallel media processing using Speculative Broadcast Loop (SBL)

IPDPS '01 Proceedings of the 15th International Parallel & Distributed Processing Symposium
On the Efficiency of Reductions in µ-SIMD Media Extensions

Proceedings of the 2001 International Conference on Parallel Architectures and Compilation Techniques
Implementation and Evaluation of the Complex Streamed Instruction Set

Proceedings of the 2001 International Conference on Parallel Architectures and Compilation Techniques
Cost-Effective Hardware Acceleration of Multimedia Applications

ICCD '01 Proceedings of the International Conference on Computer Design: VLSI in Computers & Processors
High-Performance Image Processing Using Special-Purpose CPU Instructions: The

High-Performance Image Processing Using Special-Purpose CPU Instructions: The

Exploiting ILP, TLP, and DLP with the polymorphous TRIPS architecture

Proceedings of the 30th annual international symposium on Computer architecture
A new look at exploiting data parallelism in embedded systems

Proceedings of the 2003 international conference on Compilers, architecture and synthesis for embedded systems
TRIPS: A polymorphous architecture for exploiting ILP, TLP, and DLP

ACM Transactions on Architecture and Code Optimization (TACO)
Hardware Support for Arbitrarily Complex Loop Structures in Embedded Applications

Proceedings of the conference on Design, Automation and Test in Europe - Volume 2
The implications of working set analysis on supercomputing memory hierarchy design

Proceedings of the 19th annual international conference on Supercomputing
Customized SIMD unit synthesis for system on programmable chip: a foundation for HW/SW partitioning with vectorization

ASP-DAC '06 Proceedings of the 2006 Asia and South Pacific Design Automation Conference
Auto-vectorization of interleaved data for SIMD

Proceedings of the 2006 ACM SIGPLAN conference on Programming language design and implementation
Explicit data organization SIMD instruction set architecture for media processors

PDCN'07 Proceedings of the 25th conference on Proceedings of the 25th IASTED International Multi-Conference: parallel and distributed computing and networks
Efficient vectorization of SIMD programs with non-aligned and irregular data access hardware

CASES '08 Proceedings of the 2008 international conference on Compilers, architectures and synthesis for embedded systems
A DSP-enhanced 32-bit embedded microprocessor

Journal of Embedded Computing - Selected papers of EUC 2005
A multi-streaming SIMD architecture for multimedia applications

Proceedings of the 6th ACM conference on Computing frontiers
Compiler-Based Performance Evaluation of an SIMD Processor with a Multi-Bank Memory Unit

Journal of Signal Processing Systems
Access-pattern-aware on-chip memory allocation for SIMD processors

IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems
SpiNDeK: an integrated design tool for the multiprocessor emulation of complex bioinspired spiking neural networks

CEC'09 Proceedings of the Eleventh conference on Congress on Evolutionary Computation
Data pipeline optimization for shared memory multiple-SIMD architecture

LCPC'06 Proceedings of the 19th international conference on Languages and compilers for parallel computing
A modular coprocessor architecture for embedded real-time image and video signal processing

SAMOS'07 Proceedings of the 7th international conference on Embedded computer systems: architectures, modeling, and simulation
Customizing wide-SIMD architectures for H.264

SAMOS'09 Proceedings of the 9th international conference on Systems, architectures, modeling and simulation
A multi-streaming SIMD multimedia computing engine

Microprocessors & Microsystems
A low-power DSP for wireless communications

IEEE Transactions on Very Large Scale Integration (VLSI) Systems
TurboVG: a HW/SW co-designed multi-core openVG accelerator for vector graphics applications with embedded power profiler

Proceedings of the 16th Asia and South Pacific Design Automation Conference
Architectural enhancements for network congestion control applications

IEEE Transactions on Very Large Scale Integration (VLSI) Systems
Color-Aware Instructions for Embedded Superscalar Processors

Journal of Signal Processing Systems
A DSP-Enhanced 32-bit embedded microprocessor

EUC'05 Proceedings of the 2005 international conference on Embedded and Ubiquitous Computing
A low-power DSP-enhanced 32-bit EISC processor

HiPEAC'05 Proceedings of the First international conference on High Performance Embedded Architectures and Compilers
SIMD defragmenter: efficient ILP realization on data-parallel architectures

ASPLOS XVII Proceedings of the seventeenth international conference on Architectural Support for Programming Languages and Operating Systems
Mapping streaming languages to general purpose processors through vectorization

LCPC'09 Proceedings of the 22nd international conference on Languages and Compilers for Parallel Computing
An efficient multi-core SIMD implementation for H.264/AVC encoder

VLSI Design - Special issue on VLSI Circuits, Systems, and Architectures for Advanced Image and Video Compression Standards
DRMA: dynamically reconfigurable MPSoC architecture

Proceedings of the 23rd ACM international conference on Great lakes symposium on VLSI
Easy, fast, and energy-efficient object detection on heterogeneous on-chip architectures

ACM Transactions on Architecture and Code Optimization (TACO)

Quantified Score

Hi-index	14.98

Visualization

Abstract

Multimedia SIMD extensions such as MMX and AltiVec speed up media processing; however, our characterization shows that the attributes of current general-purpose processors enhanced with SIMD extensions do not match very well with the access patterns and loop structures of media programs. We find that 75 to 85 percent of the dynamic instructions in the processor instruction stream are supporting instructions necessary to feed the SIMD execution units rather than true/useful computations, resulting in the underutilization of SIMD execution units (only 1 to 12 percent of the peak SIMD execution units' throughput is achieved). Contrary to focusing on exploiting more data-level parallelism (DLP), in this paper, we focus on the instructions that support the SIMD computations and exploit both fine and coarse-grained instruction level parallelism (ILP) in the supporting instruction stream. We propose the MediaBreeze architecture that uses hardware support for efficient address generation, looping, and data reorganization (permute, packing/unpacking, transpose, etc.). Our results on multimedia kernels show that a 2-way processor with SIMD extensions enhanced with MediaBreeze provides a better performance than a 16-way processor with current SIMD extensions. In the case of application benchmarks, a 2-/4-way processor with SIMD extensions augmented with MediaBreeze outperforms a 4-/8-way processor with SIMD extensions. A first-order approximation using ASIC synthesis tools and cell-based libraries shows that this acceleration is achieved at a 10 percent increase in area required by MMX and SSE extensions (0.3 percent increase in overall chip area) and 1 percent of total processor power consumption.