Cache Refill/Access Decoupling for Vector Machines

Authors:
Christopher Batten;Ronny Krashinsky;Steve Gerding;Krste Asanovic
Affiliations:
MIT Computer Science and Artificial Intelligence Laboratory, Cambridge, MA;MIT Computer Science and Artificial Intelligence Laboratory, Cambridge, MA;MIT Computer Science and Artificial Intelligence Laboratory, Cambridge, MA;MIT Computer Science and Artificial Intelligence Laboratory, Cambridge, MA
Venue:
Proceedings of the 37th annual IEEE/ACM International Symposium on Microarchitecture
Year:
2004

Citing 23
Cited 7

Dynamic Instruction Scheduling and the Astronautics ZS-1

Computer
Data prefetching in multiprocessor vector cache memories

ISCA '91 Proceedings of the 18th annual international symposium on Computer architecture
An effective on-chip preloading scheme to reduce data access penalty

Proceedings of the 1991 ACM/IEEE conference on Supercomputing
Evaluating stream buffers as a secondary cache replacement

ISCA '94 Proceedings of the 21st annual international symposium on Computer architecture
Improving data cache performance by pre-executing instructions under a cache miss

ICS '97 Proceedings of the 11th international conference on Supercomputing
Data prefetching on the HP PA-8000

Proceedings of the 24th annual international symposium on Computer architecture
Out-of-order vector architectures

MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
VAX vector architecture

ISCA '90 Proceedings of the 17th annual international symposium on Computer Architecture
Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers

ISCA '90 Proceedings of the 17th annual international symposium on Computer Architecture
The CRAY-1 computer system

Communications of the ACM - Special issue on computer architecture
Tarantula: a vector extension to the alpha architecture

ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
When Caches Aren't Enough: Data Prefetching Techniques

Computer
The Alpha 21264 Microprocessor

IEEE Micro
Imagine: Media Processing with Streams

IEEE Micro
Decoupled access/execute computer architectures

ISCA '82 Proceedings of the 9th annual symposium on Computer Architecture
Lockup-free instruction fetch/prefetch cache organization

ISCA '81 Proceedings of the 8th annual symposium on Computer Architecture
Decoupled vector architectures

HPCA '96 Proceedings of the 2nd IEEE Symposium on High-Performance Computer Architecture
So Many States, So Little Time: Verifying Memory Coherence in the Cray X1

IPDPS '03 Proceedings of the 17th International Symposium on Parallel and Distributed Processing
Overcoming the limitations of conventional vector processors

Proceedings of the 30th annual international symposium on Computer architecture
The Reconfigurable Streaming Vector Processor (RSVPTM)

Proceedings of the 36th annual IEEE/ACM International Symposium on Microarchitecture
The Vector-Thread Architecture

Proceedings of the 31st annual international symposium on Computer architecture
Stream Register Files with Indexed Access

HPCA '04 Proceedings of the 10th International Symposium on High Performance Computer Architecture
Runahead Execution: An Effective Alternative to Large Instruction Windows

IEEE Micro

The Vector-Thread Architecture

IEEE Micro
The potential energy efficiency of vector acceleration

Proceedings of the 2006 ACM/IEEE conference on Supercomputing
Compiling for vector-thread architectures

Proceedings of the 6th annual IEEE/ACM international symposium on Code generation and optimization
The Cray BlackWidow: a highly scalable vector multiprocessor

Proceedings of the 2007 ACM/IEEE conference on Supercomputing
Implementing the scale vector-thread processor

ACM Transactions on Design Automation of Electronic Systems (TODAES)
Active memory controller

The Journal of Supercomputing
Exploring the Tradeoffs between Programmability and Efficiency in Data-Parallel Accelerators

ACM Transactions on Computer Systems (TOCS)

Quantified Score

Hi-index	0.00

Visualization

Abstract

Vector processors often use a cache to exploit temporal locality and reduce memory bandwidth demands, but then require expensive logic to track large numbers of outstanding cache misses to sustain peak bandwidth from memory. We present refill/access decoupling, which augments the vector processor with a Vector Refill Unit (VRU) to quickly pre-execute vector memory commands and issue any needed cache line refills ahead of regular execution. The VRU reduces costs by eliminating much of the outstanding miss state required in traditional vector architectures and by using the cache itself as a cost-effective prefetch buffer. We also introduce vector segment accesses, a new class of vector memory instructions that efficiently encode two-dimensional access patterns. Segments reduce address bandwidth demands and enable more efficient refill/access decoupling by increasing the information contained in each vector memory command. Our results show that refill/access decoupling is able to achieve better performance with less resources than more traditional decoupling methods. Even with a small cache and memory latencies as long as 800 cycles, refill/access decoupling can sustain several kilobytes of in-flight data with minimal access management state and no need for expensive reserved element buffering.