A set of level 3 basic linear algebra subprograms
ACM Transactions on Mathematical Software (TOMS)
Code generation for streaming: an access/execute mechanism
ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
Data prefetching in multiprocessor vector cache memories
ISCA '91 Proceedings of the 18th annual international symposium on Computer architecture
An effective on-chip preloading scheme to reduce data access penalty
Proceedings of the 1991 ACM/IEEE conference on Supercomputing
Prefetch unit for vector operations on scalar computers
ACM SIGARCH Computer Architecture News
Design and evaluation of a compiler algorithm for prefetching
ASPLOS V Proceedings of the fifth international conference on Architectural support for programming languages and operating systems
Efficient detection of all pointer and array access errors
PLDI '94 Proceedings of the ACM SIGPLAN 1994 conference on Programming language design and implementation
Evaluating stream buffers as a secondary cache replacement
ISCA '94 Proceedings of the 21st annual international symposium on Computer architecture
Internal organization of the Alpha 21164, a 300-MHz 64-bit quad-issue CMOS RISC microprocessor
Digital Technical Journal - Special 10th anniversary issue
An effective programmable prefetch engine for on-chip caches
Proceedings of the 28th annual international symposium on Microarchitecture
PA-RISC 2.0 architecture
Synchronization and communication in the T3E multiprocessor
Proceedings of the seventh international conference on Architectural support for programming languages and operating systems
Design and evaluation of dynamic access ordering hardware
ICS '96 Proceedings of the 10th international conference on Supercomputing
Memory-system design considerations for dynamically-scheduled processors
Proceedings of the 24th annual international symposium on Computer architecture
Active pages: a computation model for intelligent memory
Proceedings of the 25th annual international symposium on Computer architecture
A bandwidth-efficient architecture for media processing
MICRO 31 Proceedings of the 31st annual ACM/IEEE international symposium on Microarchitecture
ISCA '90 Proceedings of the 17th annual international symposium on Computer Architecture
Sequential Hardware Prefetching in Shared-Memory Multiprocessors
IEEE Transactions on Parallel and Distributed Systems
Access ordering and memory-conscious cache utilization
HPCA '95 Proceedings of the 1st IEEE Symposium on High-Performance Computer Architecture
Access Order and Effective Bandwidth for Streams on a Direct Rambus Memory
HPCA '99 Proceedings of the 5th International Symposium on High Performance Computer Architecture
Impulse: Building a Smarter Memory Controller
HPCA '99 Proceedings of the 5th International Symposium on High Performance Computer Architecture
Command Vector Memory Systems: High Performance at Low Cost
PACT '98 Proceedings of the 1998 International Conference on Parallel Architectures and Compilation Techniques
Designing a Modern Memory Hierarchy with Hardware Prefetching
IEEE Transactions on Computers
Effective stream-based and execution-based data prefetching
Proceedings of the 18th annual international conference on Supercomputing
Adaptive History-Based Memory Schedulers
Proceedings of the 37th annual IEEE/ACM International Symposium on Microarchitecture
On the importance of optimizing the configuration of stream prefetchers
Proceedings of the 2005 workshop on Memory system performance
Memory access pattern analysis and stream cache design for multimedia applications
ASP-DAC '03 Proceedings of the 2003 Asia and South Pacific Design Automation Conference
Efficient emulation of hardware prefetchers via event-driven helper threading
Proceedings of the 15th international conference on Parallel architectures and compilation techniques
Prefetch-Aware DRAM Controllers
Proceedings of the 41st annual IEEE/ACM International Symposium on Microarchitecture
Application data prefetching on the IBM blue gene/Q supercomputer
SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
RDIP: return-address-stack directed instruction prefetching
Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture
Hi-index | 0.00 |
Memory system bottlenecks limit performance for many applications, and computations with strided access patterns are among the hardest hit. The streams used in such applications have extremely poor cache behavior. These access patterns have the advantage of being predictable, though, and this can be exploited to improve the efficiency of the memory subsystem in two ways: memory latencies can be masked by prefetching stream data, and the latencies can be reduced by reordering stream accesses to exploit parallelism and locality within the DRAMs. Many researchers have studied hardware prefetching in its various forms. Others have examined dynamic memory scheduling to help bridge the performance gap between processors and DRAM memory systems. This study builds on these results, combining a stride-based reference prediction table, a mechanism that prefetches L2 cache lines, and a memory controller that dynamically schedules accesses to a Direct Rambus memory subsystem. We find that such a system delivers good speedups for scientific applications with regular access patterns without negatively affecting the performance of nonstreaming programs.