The impact of architectural trends on operating system performance
SOSP '95 Proceedings of the fifteenth ACM symposium on Operating systems principles
Automatic I/O hint generation through speculative execution
OSDI '99 Proceedings of the third symposium on Operating systems design and implementation
IEEE Micro
Memory resource management in VMware ESX server
OSDI '02 Proceedings of the 5th symposium on Operating systems design and implementationCopyright restrictions prevent ACM from being able to make the PDFs for this conference available for downloading
Mac OS X Internals
Understanding The Linux Kernel
Understanding The Linux Kernel
Hardware Support for Bulk Data Movement in Server Platforms
ICCD '05 Proceedings of the 2005 International Conference on Computer Design
Flashback: a lightweight extension for rollback and deterministic replay for software debugging
ATEC '04 Proceedings of the annual conference on USENIX Annual Technical Conference
Shredding your garbage: reducing data lifetime through secure deallocation
SSYM'05 Proceedings of the 14th conference on USENIX Security Symposium - Volume 14
Protecting Cryptographic Keys from Memory Disclosure Attacks
DSN '07 Proceedings of the 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks
EXECUBE-A New Architecture for Scaleable MPPs
ICPP '94 Proceedings of the 1994 International Conference on Parallel Processing - Volume 01
HPCA '07 Proceedings of the 2007 IEEE 13th International Symposium on High Performance Computer Architecture
Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture
Self-Optimizing Memory Controllers: A Reinforcement Learning Approach
ISCA '08 Proceedings of the 35th Annual International Symposium on Computer Architecture
Lest we remember: cold boot attacks on encryption keys
SS'08 Proceedings of the 17th conference on Security symposium
SnowFlock: rapid virtual machine cloning for cloud computing
Proceedings of the 4th ACM European conference on Computer systems
Mini-rank: Adaptive DRAM architecture for improving memory power efficiency
Proceedings of the 41st annual IEEE/ACM International Symposium on Microarchitecture
Online design bug detection: RTL analysis, flexible mechanisms, and evaluation
Proceedings of the 41st annual IEEE/ACM International Symposium on Microarchitecture
Architecture Support for Improving Bulk Memory Copying and Initialization Performance
PACT '09 Proceedings of the 2009 18th International Conference on Parallel Architectures and Compilation Techniques
PLFS: a checkpoint filesystem for parallel applications
Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Application-aware prioritization mechanisms for on-chip networks
Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture
Micro-pages: increasing DRAM efficiency with locality-aware data placement
Proceedings of the fifteenth edition of ASPLOS on Architectural support for programming languages and operating systems
Rethinking DRAM design and organization for energy-constrained multi-cores
Proceedings of the 37th annual international symposium on Computer architecture
Thread Cluster Memory Scheduling: Exploiting Differences in Memory Access Behavior
MICRO '43 Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture
Operating system support for application-specific speculation
Proceedings of the sixth conference on Computer systems
Nanoscale Memory Repair
Automatic CPU-GPU communication management and optimization
Proceedings of the 32nd ACM SIGPLAN conference on Programming language design and implementation
Why nothing matters: the impact of zeroing
Proceedings of the 2011 ACM international conference on Object oriented programming systems languages and applications
Reducing memory interference in multicore systems via application-aware memory channel partitioning
Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture
A case for exploiting subarray-level parallelism (SALP) in DRAM
Proceedings of the 39th Annual International Symposium on Computer Architecture
Eternal sunshine of the spotless machine: protecting privacy with ephemeral channels
OSDI'12 Proceedings of the 10th USENIX conference on Operating Systems Design and Implementation
Tiered-latency DRAM: A low latency and low cost DRAM architecture
HPCA '13 Proceedings of the 2013 IEEE 19th International Symposium on High Performance Computer Architecture (HPCA)
MISE: Providing performance predictability and improving fairness in shared main memory systems
HPCA '13 Proceedings of the 2013 IEEE 19th International Symposium on High Performance Computer Architecture (HPCA)
SI-TM: reducing transactional memory abort rates through snapshot isolation
Proceedings of the 19th international conference on Architectural support for programming languages and operating systems
Hi-index | 0.00 |
Several system-level operations trigger bulk data copy or initialization. Even though these bulk data operations do not require any computation, current systems transfer a large quantity of data back and forth on the memory channel to perform such operations. As a result, bulk data operations consume high latency, bandwidth, and energy--degrading both system performance and energy efficiency. In this work, we propose RowClone, a new and simple mechanism to perform bulk copy and initialization completely within DRAM -- eliminating the need to transfer any data over the memory channel to perform such operations. Our key observation is that DRAM can internally and efficiently transfer a large quantity of data (multiple KBs) between a row of DRAM cells and the associated row buffer. Based on this, our primary mechanism can quickly copy an entire row of data from a source row to a destination row by first copying the data from the source row to the row buffer and then from the row buffer to the destination row, via two back-to-back activate commands. This mechanism, which we call the Fast Parallel Mode of RowClone, reduces the latency and energy consumption of a 4KB bulk copy operation by 11.6x and 74.4x, respectively, and a 4KB bulk zeroing operation by 6.0x and 41.5x, respectively. To efficiently copy data between rows that do not share a row buffer, we propose a second mode of RowClone, the Pipelined Serial Mode, which uses the shared internal bus of a DRAM chip to quickly copy data between two banks. RowClone requires only a 0.01% increase in DRAM chip area. We quantitatively evaluate the benefits of RowClone by focusing on fork, one of the frequently invoked system calls, and five other copy and initialization intensive applications. Our results show that RowClone can significantly improve both single-core and multi-core system performance, while also significantly reducing main memory bandwidth and energy consumption.