Overcoming the limitations of conventional vector processors

Authors:
Christos Kozyrakis;David Patterson
Affiliations:
Stanford University;University of California at Berkeley
Venue:
Proceedings of the 30th annual international symposium on Computer architecture
Year:
2003

Citing 19
Cited 24

Implementing Precise Interrupts in Pipelined Processors

IEEE Transactions on Computers
The multicluster architecture: reducing cycle time through partitioning

MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
Out-of-order vector architectures

MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
Vector architectures: past, present and future

ICS '98 Proceedings of the 12th international conference on Supercomputing
Lx: a technology platform for customizable VLIW embedded processing

Proceedings of the 27th annual international symposium on Computer architecture
Decoupled access/execute computer architectures

ACM Transactions on Computer Systems (TOCS)
An instruction set and microarchitecture for instruction level distributed processing

ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
Tarantula: a vector extension to the alpha architecture

ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
Computer architecture: a quantitative approach

Computer architecture: a quantitative approach
Direct Rambus Technology: The New Main Memory Standard

IEEE Micro
The Alpha 21264 Microprocessor

IEEE Micro
The MAJC Architecture: A Synthesis of Parallelism and Scalability

IEEE Micro
Imagine: Media Processing with Streams

IEEE Micro
Characterizing and predicting value degree of use

Proceedings of the 35th annual ACM/IEEE international symposium on Microarchitecture
Vector vs. superscalar and VLIW architectures for embedded multimedia benchmarks

Proceedings of the 35th annual ACM/IEEE international symposium on Microarchitecture
Decoupled vector architectures

HPCA '96 Proceedings of the 2nd IEEE Symposium on High-Performance Computer Architecture
Exploring the VLSI Scalability of Stream Processors

HPCA '03 Proceedings of the 9th International Symposium on High-Performance Computer Architecture
Vector microprocessors

Vector microprocessors
Scalable vector media-processors for embedded systems

Scalable vector media-processors for embedded systems

Universal Mechanisms for Data-Parallel Architectures

Proceedings of the 36th annual IEEE/ACM International Symposium on Microarchitecture
The Vector-Thread Architecture

Proceedings of the 31st annual international symposium on Computer architecture
Cache Refill/Access Decoupling for Vector Machines

Proceedings of the 37th annual IEEE/ACM International Symposium on Microarchitecture
The Vector-Thread Architecture

IEEE Micro
Will Moore's Law Be Sufficient?

Proceedings of the 2004 ACM/IEEE conference on Supercomputing
Reversible logic for supercomputing

Proceedings of the 2nd conference on Computing frontiers
An Integrated Memory Array Processor Architecture for Embedded Image Recognition Systems

Proceedings of the 32nd annual international symposium on Computer Architecture
SODA: A Low-power Architecture For Software Radio

Proceedings of the 33rd annual international symposium on Computer Architecture
Implementing virtual memory in a vector processor with software restart markers

Proceedings of the 20th annual international conference on Supercomputing
The potential energy efficiency of vector acceleration

Proceedings of the 2006 ACM/IEEE conference on Supercomputing
ALP: Efficient support for all levels of parallelism for complex media applications

ACM Transactions on Architecture and Code Optimization (TACO)
ParallAX: an architecture for real-time physics

Proceedings of the 34th annual international symposium on Computer architecture
An Integrated Memory Array Processor for Embedded Image Recognition Systems

IEEE Transactions on Computers
Tradeoff between data-, instruction-, and thread-level parallelism in stream processors

Proceedings of the 21st annual international conference on Supercomputing
An embedded coherent-multithreading multimedia processor and its programming model

Proceedings of the 44th annual Design Automation Conference
VESPA: portable, scalable, and flexible FPGA-based vector processors

CASES '08 Proceedings of the 2008 international conference on Compilers, architectures and synthesis for embedded systems
Address Generation Optimization for Embedded High-Performance Processors: A Survey

Journal of Signal Processing Systems
Vector Processing as a Soft Processor Accelerator

ACM Transactions on Reconfigurable Technology and Systems (TRETS)
VisoMT: a collaborative multithreading multicore processor for multimedia applications with a fast data switching mechanism

IEEE Transactions on Circuits and Systems for Video Technology
An instruction-systolic programmable shader architecture for multi-threaded 3D graphics processing

Journal of Parallel and Distributed Computing
iGPU: exception support and speculative execution on GPUs

Proceedings of the 39th Annual International Symposium on Computer Architecture
Versatile design of shared vector coprocessors for multicores

Microprocessors & Microsystems
Portable, flexible, and scalable soft vector processors

IEEE Transactions on Very Large Scale Integration (VLSI) Systems
Multicore-based vector coprocessor sharing for performance and energy gains

ACM Transactions on Embedded Computing Systems (TECS) - Special issue on application-specific processors

Quantified Score

Hi-index	0.00

Visualization

Abstract

Despite their superior performance for multimedia applications, vector processors have three limitations that hinder their widespread acceptance. First, the complexity and size of the centralized vector register file limits the number of functional units. Second, precise exceptions for vector instructions are difficult to implement. Third, vector processors require an expensive on-chip memory system that supports high bandwidth at low access latency.This paper introduces CODE, a scalable vector microarchitecture that addresses these three shortcomings. It is designed around a clustered vector register file and uses a separate network for operand transfers across functional units. With extensive use of decoupling, it can hide the latency of communication across functional units and provides 26% performance improvement over a centralized organization. CODE scales efficiently to 8 functional units without requiring wide instruction issue capabilities. A renaming table makes the clustered register file transparent at the instruction set level. Renaming also enables precise exceptions for vector instructions at a performance loss of less than 5%. Finally, decoupling allows CODE to tolerate large increases in memory latency at sub-linear performance degradation without using on-chip caches. Thus, CODE can use economical, off-chip, memory systems.