Active memory operations

  • Authors:
  • John B. Carter;Zhen Fang

  • Affiliations:
  • The University of Utah;The University of Utah

  • Venue:
  • Active memory operations
  • Year:
  • 2006

Quantified Score

Hi-index 0.00

Visualization

Abstract

The performance of memory-intensive applications is often limited by how fast the memory system can provide needed data. For local memory, the speed gap between the CPU and DRAMS leads to significant stalls when there is not enough locality in the applications' memory references for caches to be effective. For remote memory, the increasing network latency, in terms of processor clock periods, makes internode communications inordinately expensive. Caching is the standard solution to mitigate this problem. However, for applications with poor memory locality, caches do not improve performance. Also, in a cache-coherent, non-uniform memory access (cc-NUMA) system, the multiple nonoverlappable network latencies dictated by a write-invalidate coherence protocol often exacerbate the memory latency problem. Bi-section bandwidth in large-scale DSM systems is also a limiting factor for data-intensive parallel applications. As a result, reducing local memory latency, remote coherence traffic, and the number of internode data transfers is essential for multiprocessor systems to scale effectively. In general, moving data through the memory system and memory hierarchy into caches and subsequently evicting the data out of the processor core is inefficient if the data are not reused sufficiently. To attack this problem, we propose the use of Active Memory Operations (AMOs), in which select operations can be sent to and executed on data's home memory controller. AMOs can eliminate a significant number of coherence messages, minimize intranode and internode memory traffic, and create opportunities for parallelism. We present an implementation of AMOs that is cache-coherent and requires no changes to the processor core or DRAM chips. In this dissertation, we present architectural and programming models for AMOs, and compare their performance to that of several other memory architectures on a variety of scientific and commercial benchmarks. Through simulation, we show that AMOs offer dramatic performance improvements for an important set of data-intensive operations, e.g., up to 50X faster barriers, 12X faster spinlocks, 8.5X-15X faster stream/array operations, and 3X faster sequential-scan database queries. We further show that this impressive performance can be provided with little chip overhead. Based on a standard cell implementation, the circuitry required to support AMOs is predicted to be less than 1% of the typical chip area of a high performance microprocessor. AMOs are more energy-efficient than current mainstream microprocessors for AMO-optimized applications, and offer great power saving opportunities.