A Simulation Study of Decoupled Vector Architectures

Authors:
Roger Espasa;Mateo Valero
Affiliations:
Dept. Arquitectura de Computadors, Universitat Politècnica de Catalunya, Barcelona, roger@ac.upc.es;Dept. Arquitectura de Computadors, Universitat Politècnica de Catalunya, Barcelona, mateo@ac.upc.es
Venue:
The Journal of Supercomputing
Year:
1999

Citing 21
Cited 3

A Simulation Study of Decoupled Architecture Computers

IEEE Transactions on Computers
The ZS-1 central processor

ASPLOS II Proceedings of the second international conference on Architectual support for programming languages and operating systems
Software pipelining: an effective scheduling technique for VLIW machines

PLDI '88 Proceedings of the ACM SIGPLAN 1988 conference on Programming Language design and Implementation
Polycyclic Vector scheduling vs. Chaining on 1-Port Vector supercomputers

Proceedings of the 1988 ACM/IEEE conference on Supercomputing
Optimizing for parallelism and data locality

ICS '92 Proceedings of the 6th international conference on Supercomputing
Design and evaluation of a compiler algorithm for prefetching

ASPLOS V Proceedings of the fifth international conference on Architectural support for programming languages and operating systems
Designing the TFP Microprocessor

IEEE Micro
A performance study of software and hardware data prefetching schemes

ISCA '94 Proceedings of the 21st annual international symposium on Computer architecture
Explaining the gap between theoretical peak performance and real performance for supercomputer architectures

Scientific Programming
Simultaneous multithreading: maximizing on-chip parallelism

ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
Out-of-order vector architectures

MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
PIPE: a VLSI decoupled architecture

ISCA '85 Proceedings of the 12th annual international symposium on Computer architecture
Vector register design for polycyclic vector scheduling

ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
Decoupled access/execute computer architectures

ACM Transactions on Computer Systems (TOCS)
Cache performance in vector supercomputers

Proceedings of the 1994 ACM/IEEE conference on Supercomputing
The MIPS R10000 Superscalar Microprocessor

IEEE Micro
Memory Latency Effects in Decoupled Architectures

IEEE Transactions on Computers
Performance Tradeoffs in Multithreaded Processors

IEEE Transactions on Parallel and Distributed Systems
Decoupled vector architectures

HPCA '96 Proceedings of the 2nd IEEE Symposium on High-Performance Computer Architecture
Multithreaded Vector Architectures

HPCA '97 Proceedings of the 3rd IEEE Symposium on High-Performance Computer Architecture
Quantitative analysis of vector code

PDP '95 Proceedings of the 3rd Euromicro Workshop on Parallel and Distributed Processing

Trident: a scalable architecture for scalar, vector, and matrix operations

CRPIT '02 Proceedings of the seventh Asia-Pacific conference on Computer systems architecture
Matrix bidiagonalization: implementation and evaluation on the Trident processor

Neural, Parallel & Scientific Computations
The Cray BlackWidow: a highly scalable vector multiprocessor

Proceedings of the 2007 ACM/IEEE conference on Supercomputing

Quantified Score

Hi-index	0.00

Visualization

Abstract

Decoupling techniques can be applied to a vector processor, resulting in a large increase in performance of vectorizable programs. We simulate a selection of the Perfect Club and Specfp92 benchmark suites and compare their execution time on a conventional single port vector architecture with that of a decoupled vector architecture. Decoupling increases the performance by a factor greater than 1.4 for realistic memory latencies, and for an ideal memory system with zero latency, there is still a speedup of as much as 1.3. A significant portion of this paper is devoted to studying the tradeoffs involved in choosing a suitable size for the queues of the decoupled architecture. The hardware cost of the queues need not be large to achieve most of the performance advantages of decoupling.