Efficient multimedia coprocessor with enhanced SIMD engines for exploiting ILP and DLP

Authors:
Libo Huang;Nong Xiao;Zhiying Wang;Yongwen Wang;Mingche Lai
Affiliations:
-;-;-;-;-
Venue:
Parallel Computing
Year:
2013

Citing 22
Cited 0

A VLIW architecture for a trace Scheduling Compiler

IEEE Transactions on Computers - Special issue on architectural support for programming languages and operating systems
Exploiting SIMD parallelism in DSP and multimedia algorithms using the AltiVec technology

ICS '99 Proceedings of the 13th international conference on Supercomputing
Exploiting a new level of DLP in multimedia applications

Proceedings of the 32nd annual ACM/IEEE international symposium on Microarchitecture
Lx: a technology platform for customizable VLIW embedded processing

Proceedings of the 27th annual international symposium on Computer architecture
Vector instruction set support for conditional operations

Proceedings of the 27th annual international symposium on Computer architecture
Exploiting superword level parallelism with multimedia instruction sets

PLDI '00 Proceedings of the ACM SIGPLAN 2000 conference on Programming language design and implementation
The CRAY-1 computer system

Communications of the ACM - Special issue on computer architecture
Microprocessor Architectures: From VLIW to Tta

Microprocessor Architectures: From VLIW to Tta
MMX Technology Extension to the Intel Architecture

IEEE Micro
Measuring the Performance of Multimedia Instruction Sets

IEEE Transactions on Computers
Increasing and Detecting Memory Address Congruence

Proceedings of the 2002 International Conference on Parallel Architectures and Compilation Techniques
A Two Dimensional Vector Architecture for Multimedia

Euro-Par '01 Proceedings of the 7th International Euro-Par Conference Manchester on Parallel Processing
DAP—a distributed array processor

ISCA '73 Proceedings of the 1st annual symposium on Computer architecture
Cost-Effective Hardware Acceleration of Multimedia Applications

ICCD '01 Proceedings of the International Conference on Computer Design: VLSI in Computers & Processors
Architecture and Implementation of a Vector/SIMD Multiply-Accumulate Unit

IEEE Transactions on Computers
A Vector-µSIMD-VLIW Architecture for Multimedia Applications

ICPP '05 Proceedings of the 2005 International Conference on Parallel Processing
Compiler transformations for effectively exploiting a zero overhead loop buffer

Software—Practice & Experience
Synergistic Processing in Cell's Multicore Architecture

IEEE Micro
Elimination of Overhead Operations in Complex Loop Structures for Embedded Microprocessors

IEEE Transactions on Computers
Scalable Vector Processors for Embedded Systems

IEEE Micro
Using an Automated Approach to Explore and Design a High-Efficiency Processor Element for the Multimedia Domain

CISIS '08 Proceedings of the 2008 International Conference on Complex, Intelligent and Software Intensive Systems
Low-Power Multiple-Precision Iterative Floating-Point Multiplier with SIMD Support

IEEE Transactions on Computers

Quantified Score

Hi-index	0.00

Visualization

Abstract

Multimedia applications have become increasingly important in daily computing. These applications are composed of heterogeneous regions of code mixed with data-level parallelism (DLP) and instruction-level parallelism (ILP). A standard solution for a multimedia coprocessor resembles of single-instruction multiple-data (SIMD) engines into architectures exploiting ILP at compile time, such as very long instruction word (VLIW) and transport triggered architecture (TTA). However, the ILP regions fail to scale with the increased vector length to achieve high performance in the DLP regions. Furthermore, the register-to-register nature of SIMD instructions causes current SIMD engines to have limitations in handling memory alignment, data reorganization, and control flow. Many supporting instructions such as data permutations, address generations, and loop branches, are required to aid in the execution of the real SIMD computation instructions. To mitigate these problems, we propose optimized SIMD engines that have the capabilities for combining VLIW or TTA processing with a unified scalar and long vector computations as well as efficient SIMD hardware for real computation. Our new architecture is based on TTA and is called multimedia coprocessor (MCP). This architecture includes following features: (1) a simple coprocessor structure with 8-way TTA, (2) cost-effective SIMD hardware capable of performing floating-point operations, (3) long vector capabilities built upon existing SIMD hardware and a single register file and processor data path for both scalar operands and vector elements, and (4) an optimized SIMD architecture that addresses the SIMD limitations. Our experimental evaluations show that MCP can outperform conventional SIMD techniques by an average of 39% and 12% in performance for multimedia kernels and applications, respectively.