Smarter Memory: Improving Bandwidth for Streamed References

Authors:
Sally A. McKee;Robert H. Klenke;Kenneth L. Wright;William A. Wulf;Maximo H. Salinas;James H. Aylor;Alan P. Batson
Affiliations:
-;-;-;-;-;-;-
Venue:
Computer
Year:
1998

Citing 6
Cited 14

The Titan Graphics Supercomputer Architecture

Computer
A set of level 3 basic linear algebra subprograms

ACM Transactions on Mathematical Software (TOMS)
Code generation for streaming: an access/execute mechanism

ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
Design and evaluation of dynamic access ordering hardware

ICS '96 Proceedings of the 10th international conference on Supercomputing
Maximizing memory bandwidth for streamed computations

Maximizing memory bandwidth for streamed computations
When Caches Aren't Enough: Data Prefetching Techniques

Computer

Increasing power efficiency of multi-core network processors through data filtering

CASES '02 Proceedings of the 2002 international conference on Compilers, architecture, and synthesis for embedded systems
The Garp Architecture and C Compiler

Computer
Optimizing Main-Memory Join on Modern Hardware

IEEE Transactions on Knowledge and Data Engineering
Database Architecture Optimized for the New Bottleneck: Memory Access

VLDB '99 Proceedings of the 25th International Conference on Very Large Data Bases
Using the Compiler to Improve Cache Replacement Decisions

Proceedings of the 2002 International Conference on Parallel Architectures and Compilation Techniques
Optimizing database architecture for the new bottleneck: memory access

The VLDB Journal — The International Journal on Very Large Data Bases
Design and implementation of correlating caches

Proceedings of the 2004 international symposium on Low power electronics and design
Improving data cache performance with integrated use of split caches, victim cache and stream buffers

MEDEA '04 Proceedings of the 2004 workshop on MEmory performance: DEaling with Applications , systems and architecture
Optimizing software cache performance of packet processing applications

Proceedings of the 2007 ACM SIGPLAN/SIGBED conference on Languages, compilers, and tools for embedded systems
Memory scheduling for modern microprocessors

ACM Transactions on Computer Systems (TOCS)
A Hardware Acceleration Platform for Digital Holographic Imaging

Journal of Signal Processing Systems
Conjugate gradient sparse solvers: performance-power characteristics

IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Optimizing packet accesses for a domain specific language on network processors

LCPC'05 Proceedings of the 18th international conference on Languages and Compilers for Parallel Computing
Operating system support for multimedia systems

Computer Communications

Quantified Score

Hi-index	4.10

Visualization

Abstract

Memory speeds have not kept up with processor speeds. More precisely, DRAM latency has not kept pace: Processor speeds have been increasing by at least 70 percent per year, while DRAM latency has improved only 7 percent annually. As a result, a contemporary superscalar 300-MHz DEC Alpha system with a 40-ns DRAM can perform at least 24 instructions in the time it takes to access its memory just once. In a few years, if current trends continue, the number of instructions per access could increase to a thousand. Fortunately, memory bandwidth is another matter. Wider buses, multiple banks, more pins, the integrated circuit properties of DRAMs (such as static-column mode and on-chip cache), and the newer Rambus and synchronous DRAM have all contributed to band-widths that have scaled better than latency. A central problem for memory system designers is how to exploit this bandwidth to achieve lower latencies. In this article, we describe a technique that can convert more than 90 percent of a memory system's bandwidth into low-latency accesses, at least for a particular class of computations.The scheme nicely complements traditional caching in two ways: It handles frequently occurring memory reference patterns for which caches do not perform well and-by removing this problematic data from the cache-it reduces pollution, making the cache more effective for the remaining references.