A Simulation Study of Decoupled Architecture Computers
IEEE Transactions on Computers
The TI Advanced Scientific Computer
Computer
Evaluation of the WM architecture
ISCA '92 Proceedings of the 19th annual international symposium on Computer architecture
Maximizing memory bandwidth for streamed computations
Maximizing memory bandwidth for streamed computations
MediaBench: a tool for evaluating and synthesizing multimedia and communicatons systems
MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
Parallelizing compiler techniques based on linear inequalities
Parallelizing compiler techniques based on linear inequalities
Performance modeling and code partitioning for the DS architecture
Proceedings of the 25th annual international symposium on Computer architecture
A bandwidth-efficient architecture for media processing
MICRO 31 Proceedings of the 31st annual ACM/IEEE international symposium on Microarchitecture
Simple vector microprocessors for multimedia applications
MICRO 31 Proceedings of the 31st annual ACM/IEEE international symposium on Microarchitecture
Evaluating MMX technology using DSP and multimedia applications
MICRO 31 Proceedings of the 31st annual ACM/IEEE international symposium on Microarchitecture
PipeRench: a co/processor for streaming multimedia acceleration
ISCA '99 Proceedings of the 26th annual international symposium on Computer architecture
Performance of image and video processing with general-purpose processors and media ISA extensions
ISCA '99 Proceedings of the 26th annual international symposium on Computer architecture
Exploiting SIMD parallelism in DSP and multimedia algorithms using the AltiVec technology
ICS '99 Proceedings of the 13th international conference on Supercomputing
Exploiting a new level of DLP in multimedia applications
Proceedings of the 32nd annual ACM/IEEE international symposium on Microarchitecture
PIPE: a VLSI decoupled architecture
ISCA '85 Proceedings of the 12th annual international symposium on Computer architecture
Reconfigurable caches and their application to media processing
Proceedings of the 27th annual international symposium on Computer architecture
CHIMAERA: a high-performance architecture with a tightly-coupled reconfigurable functional unit
Proceedings of the 27th annual international symposium on Computer architecture
Decoupled access/execute computer architectures
ACM Transactions on Computer Systems (TOCS)
Flexible hardware acceleration for multimedia oriented microprocessors
Proceedings of the 33rd annual ACM/IEEE international symposium on Microarchitecture
DSP Processor Fundamentals: Architectures and Features
DSP Processor Fundamentals: Architectures and Features
MediaBreeze: a decoupled architecture for accelerating multimedia applications
ACM SIGARCH Computer Architecture News - Special Issue: PACT 2001 workshops
High Performance Compilers for Parallel Computing
High Performance Compilers for Parallel Computing
A Vectorizing Compiler for Multimedia Extensions
International Journal of Parallel Programming
Internet Streaming SIMD Extensions
Computer
The TigerSHARC DSP Architecture
IEEE Micro
Dynamic Parallel media processing using Speculative Broadcast Loop (SBL)
IPDPS '01 Proceedings of the 15th International Parallel & Distributed Processing Symposium
On the Efficiency of Reductions in µ-SIMD Media Extensions
Proceedings of the 2001 International Conference on Parallel Architectures and Compilation Techniques
Implementation and Evaluation of the Complex Streamed Instruction Set
Proceedings of the 2001 International Conference on Parallel Architectures and Compilation Techniques
Cost-Effective Hardware Acceleration of Multimedia Applications
ICCD '01 Proceedings of the International Conference on Computer Design: VLSI in Computers & Processors
High-Performance Image Processing Using Special-Purpose CPU Instructions: The
High-Performance Image Processing Using Special-Purpose CPU Instructions: The
Exploiting ILP, TLP, and DLP with the polymorphous TRIPS architecture
Proceedings of the 30th annual international symposium on Computer architecture
A new look at exploiting data parallelism in embedded systems
Proceedings of the 2003 international conference on Compilers, architecture and synthesis for embedded systems
TRIPS: A polymorphous architecture for exploiting ILP, TLP, and DLP
ACM Transactions on Architecture and Code Optimization (TACO)
Hardware Support for Arbitrarily Complex Loop Structures in Embedded Applications
Proceedings of the conference on Design, Automation and Test in Europe - Volume 2
The implications of working set analysis on supercomputing memory hierarchy design
Proceedings of the 19th annual international conference on Supercomputing
ASP-DAC '06 Proceedings of the 2006 Asia and South Pacific Design Automation Conference
Auto-vectorization of interleaved data for SIMD
Proceedings of the 2006 ACM SIGPLAN conference on Programming language design and implementation
Explicit data organization SIMD instruction set architecture for media processors
PDCN'07 Proceedings of the 25th conference on Proceedings of the 25th IASTED International Multi-Conference: parallel and distributed computing and networks
Efficient vectorization of SIMD programs with non-aligned and irregular data access hardware
CASES '08 Proceedings of the 2008 international conference on Compilers, architectures and synthesis for embedded systems
A DSP-enhanced 32-bit embedded microprocessor
Journal of Embedded Computing - Selected papers of EUC 2005
A multi-streaming SIMD architecture for multimedia applications
Proceedings of the 6th ACM conference on Computing frontiers
Compiler-Based Performance Evaluation of an SIMD Processor with a Multi-Bank Memory Unit
Journal of Signal Processing Systems
Access-pattern-aware on-chip memory allocation for SIMD processors
IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems
CEC'09 Proceedings of the Eleventh conference on Congress on Evolutionary Computation
Data pipeline optimization for shared memory multiple-SIMD architecture
LCPC'06 Proceedings of the 19th international conference on Languages and compilers for parallel computing
A modular coprocessor architecture for embedded real-time image and video signal processing
SAMOS'07 Proceedings of the 7th international conference on Embedded computer systems: architectures, modeling, and simulation
Customizing wide-SIMD architectures for H.264
SAMOS'09 Proceedings of the 9th international conference on Systems, architectures, modeling and simulation
A multi-streaming SIMD multimedia computing engine
Microprocessors & Microsystems
A low-power DSP for wireless communications
IEEE Transactions on Very Large Scale Integration (VLSI) Systems
Proceedings of the 16th Asia and South Pacific Design Automation Conference
Architectural enhancements for network congestion control applications
IEEE Transactions on Very Large Scale Integration (VLSI) Systems
Color-Aware Instructions for Embedded Superscalar Processors
Journal of Signal Processing Systems
A DSP-Enhanced 32-bit embedded microprocessor
EUC'05 Proceedings of the 2005 international conference on Embedded and Ubiquitous Computing
A low-power DSP-enhanced 32-bit EISC processor
HiPEAC'05 Proceedings of the First international conference on High Performance Embedded Architectures and Compilers
SIMD defragmenter: efficient ILP realization on data-parallel architectures
ASPLOS XVII Proceedings of the seventeenth international conference on Architectural Support for Programming Languages and Operating Systems
Mapping streaming languages to general purpose processors through vectorization
LCPC'09 Proceedings of the 22nd international conference on Languages and Compilers for Parallel Computing
An efficient multi-core SIMD implementation for H.264/AVC encoder
VLSI Design - Special issue on VLSI Circuits, Systems, and Architectures for Advanced Image and Video Compression Standards
DRMA: dynamically reconfigurable MPSoC architecture
Proceedings of the 23rd ACM international conference on Great lakes symposium on VLSI
Easy, fast, and energy-efficient object detection on heterogeneous on-chip architectures
ACM Transactions on Architecture and Code Optimization (TACO)
Hi-index | 14.98 |
Multimedia SIMD extensions such as MMX and AltiVec speed up media processing; however, our characterization shows that the attributes of current general-purpose processors enhanced with SIMD extensions do not match very well with the access patterns and loop structures of media programs. We find that 75 to 85 percent of the dynamic instructions in the processor instruction stream are supporting instructions necessary to feed the SIMD execution units rather than true/useful computations, resulting in the underutilization of SIMD execution units (only 1 to 12 percent of the peak SIMD execution units' throughput is achieved). Contrary to focusing on exploiting more data-level parallelism (DLP), in this paper, we focus on the instructions that support the SIMD computations and exploit both fine and coarse-grained instruction level parallelism (ILP) in the supporting instruction stream. We propose the MediaBreeze architecture that uses hardware support for efficient address generation, looping, and data reorganization (permute, packing/unpacking, transpose, etc.). Our results on multimedia kernels show that a 2-way processor with SIMD extensions enhanced with MediaBreeze provides a better performance than a 16-way processor with current SIMD extensions. In the case of application benchmarks, a 2-/4-way processor with SIMD extensions augmented with MediaBreeze outperforms a 4-/8-way processor with SIMD extensions. A first-order approximation using ASIC synthesis tools and cell-based libraries shows that this acceleration is achieved at a 10 percent increase in area required by MMX and SSE extensions (0.3 percent increase in overall chip area) and 1 percent of total processor power consumption.