Reducing DRAM Latencies with an Integrated Memory Hierarchy Design

Authors:
Wi-fen Lin
Affiliations:
-
Venue:
HPCA '01 Proceedings of the 7th International Symposium on High-Performance Computer Architecture
Year:
2001

Citing 0
Cited 54

Designing a Modern Memory Hierarchy with Hardware Prefetching

IEEE Transactions on Computers
Characterizing the d-TLB behavior of SPEC CPU2000 benchmarks

SIGMETRICS '02 Proceedings of the 2002 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Dynamic Cluster Resource Allocations for Jobs with Known and Unknown Memory Demands

IEEE Transactions on Parallel and Distributed Systems
Cached DRAM for ILP Processor Memory Access Latency Reduction

IEEE Micro
Using the Compiler to Improve Cache Replacement Decisions

Proceedings of the 2002 International Conference on Parallel Architectures and Compilation Techniques
Efficient use of memory bandwidth to improve network processor throughput

Proceedings of the 30th annual international symposium on Computer architecture
Guided region prefetching: a cooperative hardware/software approach

Proceedings of the 30th annual international symposium on Computer architecture
Design and Optimization of Large Size and Low Overhead Off-Chip Caches

IEEE Transactions on Computers
Memory Controller Optimizations for Web Servers

Proceedings of the 37th annual IEEE/ACM International Symposium on Microarchitecture
Eliminating Conflict Misses Using Prime Number-Based Cache Indexing

IEEE Transactions on Computers
A study of performance impact of memory controller features in multi-processor server environment

WMPI '04 Proceedings of the 3rd workshop on Memory performance issues: in conjunction with the 31st international symposium on computer architecture
Register saturation in instruction level parallelism

International Journal of Parallel Programming
Exploiting locality to ameliorate packet queue contention and serialization

Proceedings of the 3rd conference on Computing frontiers
The bit-reversal SDRAM address mapping

SCOPES '05 Proceedings of the 2005 workshop on Software and compilers for embedded systems
Overlapping dependent loads with addressless preload

Proceedings of the 15th international conference on Parallel architectures and compilation techniques
Victim management in a cache hierarchy

IBM Journal of Research and Development - Advanced silicon technology
Stealth prefetching

Proceedings of the 12th international conference on Architectural support for programming languages and operating systems
Virtually Pipelined Network Memory

Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
Fair Queuing Memory Systems

Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
Memory Prefetching Using Adaptive Stream Detection

Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
Adaptive insertion policies for high performance caching

Proceedings of the 34th annual international symposium on Computer architecture
Near-Memory Caching for Improved Energy Consumption

IEEE Transactions on Computers
Data access history cache and associated data prefetching mechanisms

Proceedings of the 2007 ACM/IEEE conference on Supercomputing
Optimizing thread throughput for multithreaded workloads on memory constrained CMPs

Proceedings of the 5th conference on Computing frontiers
Designing packet buffers for router linecards

IEEE/ACM Transactions on Networking (TON)
Guided Prefetching Based on Runtime Access Patterns

ICCS '08 Proceedings of the 8th international conference on Computational Science, Part III
DRAM is plenty fast for wirespeed statistics counting

ACM SIGMETRICS Performance Evaluation Review
Prefetch-Aware DRAM Controllers

Proceedings of the 41st annual IEEE/ACM International Symposium on Microarchitecture
Mini-rank: Adaptive DRAM architecture for improving memory power efficiency

Proceedings of the 41st annual IEEE/ACM International Symposium on Microarchitecture
Decoupled DIMM: building high-bandwidth memory system using low-speed DRAM devices

Proceedings of the 36th annual international symposium on Computer architecture
Thread criticality predictors for dynamic performance, power, and resource management in chip multiprocessors

Proceedings of the 36th annual international symposium on Computer architecture
Comprehensive cache performance tuning with a toolset

Future Generation Computer Systems
High-bandwidth network memory system through virtual pipelines

IEEE/ACM Transactions on Networking (TON)
Multiprocessor System-on-Chip designs with active memory processors for higher memory efficiency

Proceedings of the 46th Annual Design Automation Conference
A concurrent dynamic analysis framework for multicore hardware

Proceedings of the 24th ACM SIGPLAN conference on Object oriented programming systems languages and applications
COMPASS: a programmable data prefetcher using idle GPU shaders

Proceedings of the fifteenth edition of ASPLOS on Architectural support for programming languages and operating systems
Dynamic cluster resource allocations for jobs with known memory demands

Proceedings of the International Conference and Workshop on Emerging Trends in Technology
Early control of register pressure for software pipelined loops

CC'03 Proceedings of the 12th international conference on Compiler construction
Dynamic voltage and frequency scaling for scientific applications

LCPC'01 Proceedings of the 14th international conference on Languages and compilers for parallel computing
The virtual write queue: coordinating DRAM and last-level cache policies

Proceedings of the 37th annual international symposium on Computer architecture
Design and analysis of a robust pipelined memory system

INFOCOM'10 Proceedings of the 29th conference on Information communications
Software-hardware cooperative DRAM bank partitioning for chip multiprocessors

NPC'10 Proceedings of the 2010 IFIP international conference on Network and parallel computing
Design and performance analysis of a DRAM-based statistics counter array architecture

Proceedings of the 5th ACM/IEEE Symposium on Architectures for Networking and Communications Systems
Simple but Effective Heterogeneous Main Memory with On-Chip Memory Controller Support

Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
Exploring the prefetcher/memory controller design space: an opportunistic prefetch scheduling strategy

ARCS'11 Proceedings of the 24th international conference on Architecture of computing systems
Cache-conscious data placement in an in-memory key-value store

Proceedings of the 15th Symposium on International Database Engineering & Applications
Processor directed dynamic page policy

ACSAC'06 Proceedings of the 11th Asia-Pacific conference on Advances in Computer Systems Architecture
PACMan: prefetch-aware cache management for high performance caching

Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture
Reducing off-chip memory traffic by selective cache management scheme in GPGPUs

Proceedings of the 5th Annual Workshop on General Purpose Processing with Graphics Processing Units
SAD prefetching for MPEG4 using flux caches

SAMOS'06 Proceedings of the 6th international conference on Embedded Computer Systems: architectures, Modeling, and Simulation
DRAM-based statistics counter array architecture with performance guarantee

IEEE/ACM Transactions on Networking (TON)
Maximizing SIMD resource utilization in GPGPUs with SIMD lane permutation

Proceedings of the 40th Annual International Symposium on Computer Architecture
Return data interleaving for multi-channel embedded CMPs systems

IEEE Transactions on Very Large Scale Integration (VLSI) Systems
Meeting midway: improving CMP performance with memory-side prefetching

PACT '13 Proceedings of the 22nd international conference on Parallel architectures and compilation techniques

Quantified Score

Hi-index	0.01

Visualization

Abstract

Abstract: In this paper,we address the severe performance gap caused by high processor clock rates and slow DRAM accesses.We show that even with an aggressive,next-generation memory system using four Direct Rambus channels and an integrated one-megabyte level-two cache,a processor still spends over half of its time stalling for L2 misses.Large cache blocks can improve performance,but only when coupled with wide memory channels.DRAM address mappings also affect performance significantly.We evaluate an aggressive prefetch unit integrated with the L2 cache and memory controllers.By issuing prefetches only when the Rambus channels are idle,prioritizing them to maximize DRAM row buffer hits,and giving them low replacement priority,we achieve a 43% speedup across 10 of the 26 SPEC2000 benchmarks,without degrading performance on the others.With eight Rambus channels,these ten benchmarks improve to within 10% of the performance of a perfect L2 cache.