RowClone: fast and energy-efficient in-DRAM bulk data copy and initialization

Authors:
Vivek Seshadri;Yoongu Kim;Chris Fallin;Donghyuk Lee;Rachata Ausavarungnirun;Gennady Pekhimenko;Yixin Luo;Onur Mutlu;Phillip B. Gibbons;Michael A. Kozuch;Todd C. Mowry
Affiliations:
Carnegie Mellon University;Carnegie Mellon University;Carnegie Mellon University;Carnegie Mellon University;Carnegie Mellon University;Carnegie Mellon University;Carnegie Mellon University;Carnegie Mellon University;Intel Pittsburgh;Intel Pittsburgh;Carnegie Mellon University
Venue:
Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture
Year:
2013

Citing 35
Cited 1

The impact of architectural trends on operating system performance

SOSP '95 Proceedings of the fifteenth ACM symposium on Operating systems principles
Automatic I/O hint generation through speculative execution

OSDI '99 Proceedings of the third symposium on Operating systems design and implementation
A Case for Intelligent RAM

IEEE Micro
Memory resource management in VMware ESX server

OSDI '02 Proceedings of the 5th symposium on Operating systems design and implementationCopyright restrictions prevent ACM from being able to make the PDFs for this conference available for downloading
Mac OS X Internals

Mac OS X Internals
Understanding The Linux Kernel

Understanding The Linux Kernel
Hardware Support for Bulk Data Movement in Server Platforms

ICCD '05 Proceedings of the 2005 International Conference on Computer Design
Efficient Runahead Execution: Power-Efficient Memory Latency Tolerance

IEEE Micro
Flashback: a lightweight extension for rollback and deterministic replay for software debugging

ATEC '04 Proceedings of the annual conference on USENIX Annual Technical Conference
Shredding your garbage: reducing data lifetime through secure deallocation

SSYM'05 Proceedings of the 14th conference on USENIX Security Symposium - Volume 14
Protecting Cryptographic Keys from Memory Disclosure Attacks

DSN '07 Proceedings of the 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks
EXECUBE-A New Architecture for Scaleable MPPs

ICPP '94 Proceedings of the 1994 International Conference on Parallel Processing - Volume 01
Feedback Directed Prefetching: Improving the Performance and Bandwidth-Efficiency of Hardware Prefetchers

HPCA '07 Proceedings of the 2007 IEEE 13th International Symposium on High Performance Computer Architecture
Software-Based Online Detection of Hardware Defects Mechanisms, Architectural Support, and Evaluation

Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture
Self-Optimizing Memory Controllers: A Reinforcement Learning Approach

ISCA '08 Proceedings of the 35th Annual International Symposium on Computer Architecture
System-Level Performance Metrics for Multiprogram Workloads

IEEE Micro
Lest we remember: cold boot attacks on encryption keys

SS'08 Proceedings of the 17th conference on Security symposium
SnowFlock: rapid virtual machine cloning for cloud computing

Proceedings of the 4th ACM European conference on Computer systems
Mini-rank: Adaptive DRAM architecture for improving memory power efficiency

Proceedings of the 41st annual IEEE/ACM International Symposium on Microarchitecture
Online design bug detection: RTL analysis, flexible mechanisms, and evaluation

Proceedings of the 41st annual IEEE/ACM International Symposium on Microarchitecture
Architecture Support for Improving Bulk Memory Copying and Initialization Performance

PACT '09 Proceedings of the 2009 18th International Conference on Parallel Architectures and Compilation Techniques
PLFS: a checkpoint filesystem for parallel applications

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Application-aware prioritization mechanisms for on-chip networks

Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture
Micro-pages: increasing DRAM efficiency with locality-aware data placement

Proceedings of the fifteenth edition of ASPLOS on Architectural support for programming languages and operating systems
Rethinking DRAM design and organization for energy-constrained multi-cores

Proceedings of the 37th annual international symposium on Computer architecture
Thread Cluster Memory Scheduling: Exploiting Differences in Memory Access Behavior

MICRO '43 Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture
Operating system support for application-specific speculation

Proceedings of the sixth conference on Computer systems
Nanoscale Memory Repair

Nanoscale Memory Repair
Automatic CPU-GPU communication management and optimization

Proceedings of the 32nd ACM SIGPLAN conference on Programming language design and implementation
Why nothing matters: the impact of zeroing

Proceedings of the 2011 ACM international conference on Object oriented programming systems languages and applications
Reducing memory interference in multicore systems via application-aware memory channel partitioning

Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture
A case for exploiting subarray-level parallelism (SALP) in DRAM

Proceedings of the 39th Annual International Symposium on Computer Architecture
Eternal sunshine of the spotless machine: protecting privacy with ephemeral channels

OSDI'12 Proceedings of the 10th USENIX conference on Operating Systems Design and Implementation
Tiered-latency DRAM: A low latency and low cost DRAM architecture

HPCA '13 Proceedings of the 2013 IEEE 19th International Symposium on High Performance Computer Architecture (HPCA)
MISE: Providing performance predictability and improving fairness in shared main memory systems

HPCA '13 Proceedings of the 2013 IEEE 19th International Symposium on High Performance Computer Architecture (HPCA)

SI-TM: reducing transactional memory abort rates through snapshot isolation

Proceedings of the 19th international conference on Architectural support for programming languages and operating systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

Several system-level operations trigger bulk data copy or initialization. Even though these bulk data operations do not require any computation, current systems transfer a large quantity of data back and forth on the memory channel to perform such operations. As a result, bulk data operations consume high latency, bandwidth, and energy--degrading both system performance and energy efficiency. In this work, we propose RowClone, a new and simple mechanism to perform bulk copy and initialization completely within DRAM -- eliminating the need to transfer any data over the memory channel to perform such operations. Our key observation is that DRAM can internally and efficiently transfer a large quantity of data (multiple KBs) between a row of DRAM cells and the associated row buffer. Based on this, our primary mechanism can quickly copy an entire row of data from a source row to a destination row by first copying the data from the source row to the row buffer and then from the row buffer to the destination row, via two back-to-back activate commands. This mechanism, which we call the Fast Parallel Mode of RowClone, reduces the latency and energy consumption of a 4KB bulk copy operation by 11.6x and 74.4x, respectively, and a 4KB bulk zeroing operation by 6.0x and 41.5x, respectively. To efficiently copy data between rows that do not share a row buffer, we propose a second mode of RowClone, the Pipelined Serial Mode, which uses the shared internal bus of a DRAM chip to quickly copy data between two banks. RowClone requires only a 0.01% increase in DRAM chip area. We quantitatively evaluate the benefits of RowClone by focusing on fork, one of the frequently invoked system calls, and five other copy and initialization intensive applications. Our results show that RowClone can significantly improve both single-core and multi-core system performance, while also significantly reducing main memory bandwidth and energy consumption.