Micro-pages: increasing DRAM efficiency with locality-aware data placement

Authors:
Kshitij Sudan;Niladrish Chatterjee;David Nellans;Manu Awasthi;Rajeev Balasubramonian;Al Davis
Affiliations:
University of Utah, Salt Lake City, UT, USA;University of Utah, Salt Lake City, UT, USA;University of Utah, Salt Lake City, UT, USA;University of Utah, Salt Lake City, UT, USA;University of Utah, Salt Lake City, UT, USA;University of Utah, Salt Lake City, UT, USA
Venue:
Proceedings of the fifteenth edition of ASPLOS on Architectural support for programming languages and operating systems
Year:
2010

Citing 46
Cited 30

Page placement policies for NUMA multiprocessors

Journal of Parallel and Distributed Computing
Exploiting operating system support for dynamic page placement on a NUMA shared memory multiprocessor

PPOPP '91 Proceedings of the third ACM SIGPLAN symposium on Principles and practice of parallel programming
Page placement algorithms for large real-indexed caches

ACM Transactions on Computer Systems (TOCS)
Scheduling and page migration for multiprocessor compute servers

ASPLOS VI Proceedings of the sixth international conference on Architectural support for programming languages and operating systems
Avoiding conflict misses dynamically in large direct-mapped caches

ASPLOS VI Proceedings of the sixth international conference on Architectural support for programming languages and operating systems
Surpassing the TLB performance of superpages with less operating system support

ASPLOS VI Proceedings of the sixth international conference on Architectural support for programming languages and operating systems
Reducing TLB and memory overhead using online superpage promotion

ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
Operating system support for improving data locality on CC-NUMA compute servers

Proceedings of the seventh international conference on Architectural support for programming languages and operating systems
The art of computer programming, volume 1 (3rd ed.): fundamental algorithms

The art of computer programming, volume 1 (3rd ed.): fundamental algorithms
Increasing TLB reach using superpages backed by shadow memory

Proceedings of the 25th annual international symposium on Computer architecture
A performance comparison of contemporary DRAM architectures

ISCA '99 Proceedings of the 26th annual international symposium on Computer architecture
Reducing cache misses using hardware and software page placement

ICS '99 Proceedings of the 13th international conference on Supercomputing
Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers

ISCA '90 Proceedings of the 17th annual international symposium on Computer Architecture
Online superpage promotion revisited (poster session)

Proceedings of the 2000 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Memory access scheduling

Proceedings of the 27th annual international symposium on Computer architecture
A permutation-based page interleaving scheme to reduce row-buffer conflicts and exploit data locality

Proceedings of the 33rd annual ACM/IEEE international symposium on Microarchitecture
Concurrency, latency, or system overhead: which has the largest impact on uniprocessor DRAM-system performance?

ISCA '01 Proceedings of the 28th annual international symposium on Computer architecture
Memory controller policies for DRAM power management

ISLPED '01 Proceedings of the 2001 international symposium on Low power electronics and design
Improving Performance of Large Physically Indexed Caches by Decoupling Memory Addresses from Cache Addresses

IEEE Transactions on Computers
Designing a Modern Memory Hierarchy with Hardware Prefetching

IEEE Transactions on Computers
Symbiotic jobscheduling with priorities for a simultaneous multithreading processor

SIGMETRICS '02 Proceedings of the 2002 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Simics: A Full System Simulation Platform

Computer
Impulse: Building a Smarter Memory Controller

HPCA '99 Proceedings of the 5th International Symposium on High Performance Computer Architecture
Experimental Comparison of Memory Management Policies for NUMA Multiprocessors

Experimental Comparison of Memory Management Policies for NUMA Multiprocessors
DRAM Energy Management Using Sof ware and Hardware Directed Power Mode Control

HPCA '01 Proceedings of the 7th International Symposium on High-Performance Computer Architecture
Fine-grain Priority Scheduling on Multi-channel Memory Systems

HPCA '02 Proceedings of the 8th International Symposium on High-Performance Computer Architecture
A Performance Comparison of DRAM Memory System Optimizations for SMT Processors

HPCA '05 Proceedings of the 11th International Symposium on High-Performance Computer Architecture
Improving energy efficiency by making DRAM less randomly accessed

ISLPED '05 Proceedings of the 2005 international symposium on Low power electronics and design
DRAMsim: a memory system simulator

ACM SIGARCH Computer Architecture News - Special issue: dasCMP'05
Page migration with dynamic space-sharing scheduling policies: the case of the SGI 02000

International Journal of Parallel Programming - Special issue II: The 17th annual international conference on supercomputing (ICS'03)
Architectural support for operating system-driven CMP cache management

Proceedings of the 15th international conference on Parallel architectures and compilation techniques
SPEC CPU2006 benchmark descriptions

ACM SIGARCH Computer Architecture News
Managing Distributed, Shared L2 Caches through OS-Level Page Allocation

Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
Design and implementation of power-aware virtual memory

ATEC '03 Proceedings of the annual conference on USENIX Annual Technical Conference
BioBench: A Benchmark Suite of Bioinformatics Applications

ISPASS '05 Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software, 2005
Optimizing NUCA Organizations and Wiring Alternatives for Large Caches with CACTI 6.0

Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture
Stall-Time Fair Memory Access Scheduling for Chip Multiprocessors

Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture
Parallelism-Aware Batch Scheduling: Enhancing both Performance and Fairness of Shared DRAM Systems

ISCA '08 Proceedings of the 35th Annual International Symposium on Computer Architecture
Understanding and Designing New Server Architectures for Emerging Warehouse-Computing Environments

ISCA '08 Proceedings of the 35th Annual International Symposium on Computer Architecture
Mini-rank: Adaptive DRAM architecture for improving memory power efficiency

Proceedings of the 41st annual IEEE/ACM International Symposium on Microarchitecture
Memory Systems: Cache, DRAM, Disk

Memory Systems: Cache, DRAM, Disk
Reactive NUCA: near-optimal block placement and replication in distributed caches

Proceedings of the 36th annual international symposium on Computer architecture
Decoupled DIMM: building high-bandwidth memory system using low-speed DRAM devices

Proceedings of the 36th annual international symposium on Computer architecture
Disaggregated memory for expansion and sharing in blade servers

Proceedings of the 36th annual international symposium on Computer architecture
The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines

The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines
Hardware execution throttling for multi-core resource management

USENIX'09 Proceedings of the 2009 conference on USENIX Annual technical conference

Rethinking DRAM design and organization for energy-constrained multi-cores

Proceedings of the 37th annual international symposium on Computer architecture
MemScale: active low-power modes for main memory

Proceedings of the sixteenth international conference on Architectural support for programming languages and operating systems
Page placement in hybrid memory systems

Proceedings of the international conference on Supercomputing
Memory power management via dynamic voltage/frequency scaling

Proceedings of the 8th ACM international conference on Autonomic computing
Dymaxion: optimizing memory access patterns for heterogeneous systems

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Improving System Energy Efficiency with Memory Rank Subsetting

ACM Transactions on Architecture and Code Optimization (TACO)
Reducing memory interference in multicore systems via application-aware memory channel partitioning

Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture
Rank idle time prediction driven last-level cache writeback

Proceedings of the 2012 ACM SIGPLAN Workshop on Memory Systems Performance and Correctness
Multiple sub-row buffers in DRAM: unlocking performance and energy improvement opportunities

Proceedings of the 26th ACM international conference on Supercomputing
Unified memory optimizing architecture: memory subsystem control with a unified predictor

Proceedings of the 26th ACM international conference on Supercomputing
PARDIS: a programmable memory controller for the DDRx interfacing standards

Proceedings of the 39th Annual International Symposium on Computer Architecture
Improving writeback efficiency with decoupled last-write prediction

Proceedings of the 39th Annual International Symposium on Computer Architecture
A case for exploiting subarray-level parallelism (SALP) in DRAM

Proceedings of the 39th Annual International Symposium on Computer Architecture
A software memory partition approach for eliminating bank-level interference in multicore systems

Proceedings of the 21st international conference on Parallel architectures and compilation techniques
RAMZzz: rank-aware dram power management with dynamic migrations and demotions

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
A survey of architectural techniques for DRAM power management

International Journal of High Performance Systems Architecture
Regularities considered harmful: forcing randomness to memory accesses to reduce row buffer conflicts for multi-core, multi-bank systems

Proceedings of the eighteenth international conference on Architectural support for programming languages and operating systems
Leveraging Heterogeneity in DRAM Main Memories to Accelerate Critical Word Access

MICRO-45 Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture
CoScale: Coordinating CPU and Memory System DVFS in Server Systems

MICRO-45 Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture
Proactive aging management in heterogeneous NoCs through a criticality-driven routing approach

Proceedings of the Conference on Design, Automation and Test in Europe
Reducing memory access latency with asymmetric DRAM bank organizations

Proceedings of the 40th Annual International Symposium on Computer Architecture
Effect of page frame allocation pattern on bank conflicts in multi-core systems

Proceedings of the 2013 Research in Adaptive and Convergent Systems
Exploring hybrid memory for GPU energy efficiency through software-hardware co-design

PACT '13 Proceedings of the 22nd international conference on Parallel architectures and compilation techniques
Reshaping cache misses to improve row-buffer locality in multicore systems

PACT '13 Proceedings of the 22nd international conference on Parallel architectures and compilation techniques
Meeting midway: improving CMP performance with memory-side prefetching

PACT '13 Proceedings of the 22nd international conference on Parallel architectures and compilation techniques
A programmable memory controller for the DDRx interfacing standards

ACM Transactions on Computer Systems (TOCS)
RowClone: fast and energy-efficient in-DRAM bulk data copy and initialization

Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture
Reducing DRAM row activations with eager read/write clustering

ACM Transactions on Architecture and Code Optimization (TACO)
WADE: Writeback-aware dynamic cache management for NVM-based main memory system

ACM Transactions on Architecture and Code Optimization (TACO)
BPM/BPM+: Software-based dynamic memory partitioning mechanisms for mitigating DRAM bank-/channel-level interferences in multicore systems

ACM Transactions on Architecture and Code Optimization (TACO)

Quantified Score

Hi-index	0.00

Visualization

Abstract

Power consumption and DRAM latencies are serious concerns in modern chip-multiprocessor (CMP or multi-core) based compute systems. The management of the DRAM row buffer can significantly impact both power consumption and latency. Modern DRAM systems read data from cell arrays and populate a row buffer as large as 8 KB on a memory request. But only a small fraction of these bits are ever returned back to the CPU. This ends up wasting energy and time to read (and subsequently write back) bits which are used rarely. Traditionally, an open-page policy has been used for uni-processor systems and it has worked well because of spatial and temporal locality in the access stream. In future multi-core processors, the possibly independent access streams of each core are interleaved, thus destroying the available locality and significantly under-utilizing the contents of the row buffer. In this work, we attempt to improve row-buffer utilization for future multi-core systems. The schemes presented here are motivated by our observations that a large number of accesses within heavily accessed OS pages are to small, contiguous "chunks" of cache blocks. Thus, the co-location of chunks (from different OS pages) in a row-buffer will improve the overall utilization of the row buffer contents, and consequently reduce memory energy consumption and access time. Such co-location can be achieved in many ways, notably involving a reduction in OS page size and software or hardware assisted migration of data within DRAM. We explore these mechanisms and discuss the trade-offs involved along with energy and performance improvements from each scheme. On average, for applications with room for improvement, our best performing scheme increases performance by 9% (max. 18%) and reduces memory energy consumption by 15% (max. 70%).