Data prefetching in multiprocessor vector cache memories
ISCA '91 Proceedings of the 18th annual international symposium on Computer architecture
An effective on-chip preloading scheme to reduce data access penalty
Proceedings of the 1991 ACM/IEEE conference on Supercomputing
Evaluating stream buffers as a secondary cache replacement
ISCA '94 Proceedings of the 21st annual international symposium on Computer architecture
Improving data cache performance by pre-executing instructions under a cache miss
ICS '97 Proceedings of the 11th international conference on Supercomputing
Data prefetching on the HP PA-8000
Proceedings of the 24th annual international symposium on Computer architecture
Out-of-order vector architectures
MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
ISCA '90 Proceedings of the 17th annual international symposium on Computer Architecture
ISCA '90 Proceedings of the 17th annual international symposium on Computer Architecture
Communications of the ACM - Special issue on computer architecture
Tarantula: a vector extension to the alpha architecture
ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
The Alpha 21264 Microprocessor
IEEE Micro
Imagine: Media Processing with Streams
IEEE Micro
Decoupled access/execute computer architectures
ISCA '82 Proceedings of the 9th annual symposium on Computer Architecture
Lockup-free instruction fetch/prefetch cache organization
ISCA '81 Proceedings of the 8th annual symposium on Computer Architecture
Decoupled vector architectures
HPCA '96 Proceedings of the 2nd IEEE Symposium on High-Performance Computer Architecture
So Many States, So Little Time: Verifying Memory Coherence in the Cray X1
IPDPS '03 Proceedings of the 17th International Symposium on Parallel and Distributed Processing
Overcoming the limitations of conventional vector processors
Proceedings of the 30th annual international symposium on Computer architecture
The Reconfigurable Streaming Vector Processor (RSVPTM)
Proceedings of the 36th annual IEEE/ACM International Symposium on Microarchitecture
The Vector-Thread Architecture
Proceedings of the 31st annual international symposium on Computer architecture
Stream Register Files with Indexed Access
HPCA '04 Proceedings of the 10th International Symposium on High Performance Computer Architecture
The Vector-Thread Architecture
IEEE Micro
The potential energy efficiency of vector acceleration
Proceedings of the 2006 ACM/IEEE conference on Supercomputing
Compiling for vector-thread architectures
Proceedings of the 6th annual IEEE/ACM international symposium on Code generation and optimization
The Cray BlackWidow: a highly scalable vector multiprocessor
Proceedings of the 2007 ACM/IEEE conference on Supercomputing
Implementing the scale vector-thread processor
ACM Transactions on Design Automation of Electronic Systems (TODAES)
The Journal of Supercomputing
Exploring the Tradeoffs between Programmability and Efficiency in Data-Parallel Accelerators
ACM Transactions on Computer Systems (TOCS)
Hi-index | 0.00 |
Vector processors often use a cache to exploit temporal locality and reduce memory bandwidth demands, but then require expensive logic to track large numbers of outstanding cache misses to sustain peak bandwidth from memory. We present refill/access decoupling, which augments the vector processor with a Vector Refill Unit (VRU) to quickly pre-execute vector memory commands and issue any needed cache line refills ahead of regular execution. The VRU reduces costs by eliminating much of the outstanding miss state required in traditional vector architectures and by using the cache itself as a cost-effective prefetch buffer. We also introduce vector segment accesses, a new class of vector memory instructions that efficiently encode two-dimensional access patterns. Segments reduce address bandwidth demands and enable more efficient refill/access decoupling by increasing the information contained in each vector memory command. Our results show that refill/access decoupling is able to achieve better performance with less resources than more traditional decoupling methods. Even with a small cache and memory latencies as long as 800 cycles, refill/access decoupling can sustain several kilobytes of in-flight data with minimal access management state and no need for expensive reserved element buffering.