A Simulation Study of Decoupled Architecture Computers
IEEE Transactions on Computers
ASPLOS II Proceedings of the second international conference on Architectual support for programming languages and operating systems
Software pipelining: an effective scheduling technique for VLIW machines
PLDI '88 Proceedings of the ACM SIGPLAN 1988 conference on Programming Language design and Implementation
Polycyclic Vector scheduling vs. Chaining on 1-Port Vector supercomputers
Proceedings of the 1988 ACM/IEEE conference on Supercomputing
Optimizing for parallelism and data locality
ICS '92 Proceedings of the 6th international conference on Supercomputing
Design and evaluation of a compiler algorithm for prefetching
ASPLOS V Proceedings of the fifth international conference on Architectural support for programming languages and operating systems
Designing the TFP Microprocessor
IEEE Micro
A performance study of software and hardware data prefetching schemes
ISCA '94 Proceedings of the 21st annual international symposium on Computer architecture
Simultaneous multithreading: maximizing on-chip parallelism
ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
Vector register design for polycyclic vector scheduling
ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
Decoupled access/execute computer architectures
ACM Transactions on Computer Systems (TOCS)
Cache performance in vector supercomputers
Proceedings of the 1994 ACM/IEEE conference on Supercomputing
Memory Latency Effects in Decoupled Architectures
IEEE Transactions on Computers
Performance Tradeoffs in Multithreaded Processors
IEEE Transactions on Parallel and Distributed Systems
Quantitative analysis of vector code
PDP '95 Proceedings of the 3rd Euromicro Workshop on Parallel and Distributed Processing
A victim cache for vector registers
ICS '97 Proceedings of the 11th international conference on Supercomputing
Out-of-order vector architectures
MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
A performance study of out-of-order vector architectures and short registers
ICS '98 Proceedings of the 12th international conference on Supercomputing
Vector architectures: past, present and future
ICS '98 Proceedings of the 12th international conference on Supercomputing
A Simulation Study of Decoupled Vector Architectures
The Journal of Supercomputing
Vector vs. superscalar and VLIW architectures for embedded multimedia benchmarks
Proceedings of the 35th annual ACM/IEEE international symposium on Microarchitecture
Overcoming the limitations of conventional vector processors
Proceedings of the 30th annual international symposium on Computer architecture
Cache Refill/Access Decoupling for Vector Machines
Proceedings of the 37th annual IEEE/ACM International Symposium on Microarchitecture
Implementing virtual memory in a vector processor with software restart markers
Proceedings of the 20th annual international conference on Supercomputing
The potential energy efficiency of vector acceleration
Proceedings of the 2006 ACM/IEEE conference on Supercomputing
Neural, Parallel & Scientific Computations
Exploring the tradeoffs between programmability and efficiency in data-parallel accelerators
Proceedings of the 38th annual international symposium on Computer architecture
Exploring the Tradeoffs between Programmability and Efficiency in Data-Parallel Accelerators
ACM Transactions on Computer Systems (TOCS)
A shared matrix unit for a chip multi-core processor
Journal of Parallel and Distributed Computing
Hi-index | 0.00 |
The purpose of this paper is to show that using decoupling techniques in a vector processor, the performance of vector programs can be greatly improved. Using a trace driven approach, we simulate a selection of the Perfect Club programs and compare their execution time on a conventional vector architecture and on a decoupled vector architecture. Decoupling provides a performance advantage of more than a factor of two for realistic memory latencies, and even with an ideal memory system with no latency, there is still a speedup of as much as 50%. A bypassing technique between the load/store queues is introduced and we show how it can give up to an extra speedup of 22% while also reducing total memory traffic by an average of 20%. An important part of this paper is devoted to study the tradeoffs involved in choosing an adequate size for the different queues of the architecture, so that the hardware cost of the queues can be minimized while still retaining most of the performance advantages of decoupling.