A VLIW architecture for a trace Scheduling Compiler
IEEE Transactions on Computers - Special issue on architectural support for programming languages and operating systems
Exploiting SIMD parallelism in DSP and multimedia algorithms using the AltiVec technology
ICS '99 Proceedings of the 13th international conference on Supercomputing
Exploiting a new level of DLP in multimedia applications
Proceedings of the 32nd annual ACM/IEEE international symposium on Microarchitecture
Lx: a technology platform for customizable VLIW embedded processing
Proceedings of the 27th annual international symposium on Computer architecture
Vector instruction set support for conditional operations
Proceedings of the 27th annual international symposium on Computer architecture
Exploiting superword level parallelism with multimedia instruction sets
PLDI '00 Proceedings of the ACM SIGPLAN 2000 conference on Programming language design and implementation
Communications of the ACM - Special issue on computer architecture
Microprocessor Architectures: From VLIW to Tta
Microprocessor Architectures: From VLIW to Tta
Measuring the Performance of Multimedia Instruction Sets
IEEE Transactions on Computers
Increasing and Detecting Memory Address Congruence
Proceedings of the 2002 International Conference on Parallel Architectures and Compilation Techniques
A Two Dimensional Vector Architecture for Multimedia
Euro-Par '01 Proceedings of the 7th International Euro-Par Conference Manchester on Parallel Processing
DAP—a distributed array processor
ISCA '73 Proceedings of the 1st annual symposium on Computer architecture
Cost-Effective Hardware Acceleration of Multimedia Applications
ICCD '01 Proceedings of the International Conference on Computer Design: VLSI in Computers & Processors
Architecture and Implementation of a Vector/SIMD Multiply-Accumulate Unit
IEEE Transactions on Computers
A Vector-µSIMD-VLIW Architecture for Multimedia Applications
ICPP '05 Proceedings of the 2005 International Conference on Parallel Processing
Compiler transformations for effectively exploiting a zero overhead loop buffer
Software—Practice & Experience
Elimination of Overhead Operations in Complex Loop Structures for Embedded Microprocessors
IEEE Transactions on Computers
CISIS '08 Proceedings of the 2008 International Conference on Complex, Intelligent and Software Intensive Systems
Low-Power Multiple-Precision Iterative Floating-Point Multiplier with SIMD Support
IEEE Transactions on Computers
Hi-index | 0.00 |
Multimedia applications have become increasingly important in daily computing. These applications are composed of heterogeneous regions of code mixed with data-level parallelism (DLP) and instruction-level parallelism (ILP). A standard solution for a multimedia coprocessor resembles of single-instruction multiple-data (SIMD) engines into architectures exploiting ILP at compile time, such as very long instruction word (VLIW) and transport triggered architecture (TTA). However, the ILP regions fail to scale with the increased vector length to achieve high performance in the DLP regions. Furthermore, the register-to-register nature of SIMD instructions causes current SIMD engines to have limitations in handling memory alignment, data reorganization, and control flow. Many supporting instructions such as data permutations, address generations, and loop branches, are required to aid in the execution of the real SIMD computation instructions. To mitigate these problems, we propose optimized SIMD engines that have the capabilities for combining VLIW or TTA processing with a unified scalar and long vector computations as well as efficient SIMD hardware for real computation. Our new architecture is based on TTA and is called multimedia coprocessor (MCP). This architecture includes following features: (1) a simple coprocessor structure with 8-way TTA, (2) cost-effective SIMD hardware capable of performing floating-point operations, (3) long vector capabilities built upon existing SIMD hardware and a single register file and processor data path for both scalar operands and vector elements, and (4) an optimized SIMD architecture that addresses the SIMD limitations. Our experimental evaluations show that MCP can outperform conventional SIMD techniques by an average of 39% and 12% in performance for multimedia kernels and applications, respectively.