A Simulation Study of Decoupled Architecture Computers
IEEE Transactions on Computers
ASPLOS II Proceedings of the second international conference on Architectual support for programming languages and operating systems
Software pipelining: an effective scheduling technique for VLIW machines
PLDI '88 Proceedings of the ACM SIGPLAN 1988 conference on Programming Language design and Implementation
Polycyclic Vector scheduling vs. Chaining on 1-Port Vector supercomputers
Proceedings of the 1988 ACM/IEEE conference on Supercomputing
Optimizing for parallelism and data locality
ICS '92 Proceedings of the 6th international conference on Supercomputing
Design and evaluation of a compiler algorithm for prefetching
ASPLOS V Proceedings of the fifth international conference on Architectural support for programming languages and operating systems
Designing the TFP Microprocessor
IEEE Micro
A performance study of software and hardware data prefetching schemes
ISCA '94 Proceedings of the 21st annual international symposium on Computer architecture
Simultaneous multithreading: maximizing on-chip parallelism
ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
Out-of-order vector architectures
MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
PIPE: a VLSI decoupled architecture
ISCA '85 Proceedings of the 12th annual international symposium on Computer architecture
Vector register design for polycyclic vector scheduling
ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
Decoupled access/execute computer architectures
ACM Transactions on Computer Systems (TOCS)
Cache performance in vector supercomputers
Proceedings of the 1994 ACM/IEEE conference on Supercomputing
The MIPS R10000 Superscalar Microprocessor
IEEE Micro
Memory Latency Effects in Decoupled Architectures
IEEE Transactions on Computers
Performance Tradeoffs in Multithreaded Processors
IEEE Transactions on Parallel and Distributed Systems
Decoupled vector architectures
HPCA '96 Proceedings of the 2nd IEEE Symposium on High-Performance Computer Architecture
Multithreaded Vector Architectures
HPCA '97 Proceedings of the 3rd IEEE Symposium on High-Performance Computer Architecture
Quantitative analysis of vector code
PDP '95 Proceedings of the 3rd Euromicro Workshop on Parallel and Distributed Processing
Trident: a scalable architecture for scalar, vector, and matrix operations
CRPIT '02 Proceedings of the seventh Asia-Pacific conference on Computer systems architecture
Matrix bidiagonalization: implementation and evaluation on the Trident processor
Neural, Parallel & Scientific Computations
The Cray BlackWidow: a highly scalable vector multiprocessor
Proceedings of the 2007 ACM/IEEE conference on Supercomputing
Hi-index | 0.00 |
Decoupling techniques can be applied to a vector processor, resulting in a large increase in performance of vectorizable programs. We simulate a selection of the Perfect Club and Specfp92 benchmark suites and compare their execution time on a conventional single port vector architecture with that of a decoupled vector architecture. Decoupling increases the performance by a factor greater than 1.4 for realistic memory latencies, and for an ideal memory system with zero latency, there is still a speedup of as much as 1.3. A significant portion of this paper is devoted to studying the tradeoffs involved in choosing a suitable size for the queues of the decoupled architecture. The hardware cost of the queues need not be large to achieve most of the performance advantages of decoupling.