Hardware-only stream prefetching and dynamic access ordering

Authors:
Chengqiang Zhang;Sally A. McKee
Affiliations:
Department of Computer Science, University of Utah, Salt Lake City, UT;Department of Computer Science, University of Utah, Salt Lake City, UT
Venue:
Proceedings of the 14th international conference on Supercomputing
Year:
2000

Citing 22
Cited 10

A set of level 3 basic linear algebra subprograms

ACM Transactions on Mathematical Software (TOMS)
Code generation for streaming: an access/execute mechanism

ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
Data prefetching in multiprocessor vector cache memories

ISCA '91 Proceedings of the 18th annual international symposium on Computer architecture
An effective on-chip preloading scheme to reduce data access penalty

Proceedings of the 1991 ACM/IEEE conference on Supercomputing
Prefetch unit for vector operations on scalar computers

ACM SIGARCH Computer Architecture News
Design and evaluation of a compiler algorithm for prefetching

ASPLOS V Proceedings of the fifth international conference on Architectural support for programming languages and operating systems
Efficient detection of all pointer and array access errors

PLDI '94 Proceedings of the ACM SIGPLAN 1994 conference on Programming language design and implementation
Evaluating stream buffers as a secondary cache replacement

ISCA '94 Proceedings of the 21st annual international symposium on Computer architecture
Internal organization of the Alpha 21164, a 300-MHz 64-bit quad-issue CMOS RISC microprocessor

Digital Technical Journal - Special 10th anniversary issue
An effective programmable prefetch engine for on-chip caches

Proceedings of the 28th annual international symposium on Microarchitecture
PA-RISC 2.0 architecture

PA-RISC 2.0 architecture
Synchronization and communication in the T3E multiprocessor

Proceedings of the seventh international conference on Architectural support for programming languages and operating systems
Design and evaluation of dynamic access ordering hardware

ICS '96 Proceedings of the 10th international conference on Supercomputing
Memory-system design considerations for dynamically-scheduled processors

Proceedings of the 24th annual international symposium on Computer architecture
Active pages: a computation model for intelligent memory

Proceedings of the 25th annual international symposium on Computer architecture
A bandwidth-efficient architecture for media processing

MICRO 31 Proceedings of the 31st annual ACM/IEEE international symposium on Microarchitecture
Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers

ISCA '90 Proceedings of the 17th annual international symposium on Computer Architecture
Sequential Hardware Prefetching in Shared-Memory Multiprocessors

IEEE Transactions on Parallel and Distributed Systems
Access ordering and memory-conscious cache utilization

HPCA '95 Proceedings of the 1st IEEE Symposium on High-Performance Computer Architecture
Access Order and Effective Bandwidth for Streams on a Direct Rambus Memory

HPCA '99 Proceedings of the 5th International Symposium on High Performance Computer Architecture
Impulse: Building a Smarter Memory Controller

HPCA '99 Proceedings of the 5th International Symposium on High Performance Computer Architecture
Command Vector Memory Systems: High Performance at Low Cost

PACT '98 Proceedings of the 1998 International Conference on Parallel Architectures and Compilation Techniques

Designing a Modern Memory Hierarchy with Hardware Prefetching

IEEE Transactions on Computers
Effective stream-based and execution-based data prefetching

Proceedings of the 18th annual international conference on Supercomputing
Adaptive History-Based Memory Schedulers

Proceedings of the 37th annual IEEE/ACM International Symposium on Microarchitecture
On the importance of optimizing the configuration of stream prefetchers

Proceedings of the 2005 workshop on Memory system performance
Adaptive History-Based Memory Schedulers for Modern Processors

IEEE Micro
Memory access pattern analysis and stream cache design for multimedia applications

ASP-DAC '03 Proceedings of the 2003 Asia and South Pacific Design Automation Conference
Efficient emulation of hardware prefetchers via event-driven helper threading

Proceedings of the 15th international conference on Parallel architectures and compilation techniques
Prefetch-Aware DRAM Controllers

Proceedings of the 41st annual IEEE/ACM International Symposium on Microarchitecture
Application data prefetching on the IBM blue gene/Q supercomputer

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
RDIP: return-address-stack directed instruction prefetching

Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture

Quantified Score

Hi-index	0.00

Visualization

Abstract

Memory system bottlenecks limit performance for many applications, and computations with strided access patterns are among the hardest hit. The streams used in such applications have extremely poor cache behavior. These access patterns have the advantage of being predictable, though, and this can be exploited to improve the efficiency of the memory subsystem in two ways: memory latencies can be masked by prefetching stream data, and the latencies can be reduced by reordering stream accesses to exploit parallelism and locality within the DRAMs. Many researchers have studied hardware prefetching in its various forms. Others have examined dynamic memory scheduling to help bridge the performance gap between processors and DRAM memory systems. This study builds on these results, combining a stride-based reference prediction table, a mechanism that prefetches L2 cache lines, and a memory controller that dynamically schedules accesses to a Direct Rambus memory subsystem. We find that such a system delivers good speedups for scientific applications with regular access patterns without negatively affecting the performance of nonstreaming programs.