Active memory controller

Authors:
Zhen Fang;Lixin Zhang;John B. Carter;Sally A. Mckee;Ali Ibrahim;Michael A. Parker;Xiaowei Jiang
Affiliations:
nVidia Corporation, Santa Clara, USA;Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China;IBM Austin Research Lab, Austin, USA;Chalmers University of Technology, Gothenburg, Sweden;AMD, Sunnyvale, USA;nVidia Corporation, Santa Clara, USA;Intel Labs, Intel Corporation, Santa Clara, USA
Venue:
The Journal of Supercomputing
Year:
2012

Citing 42
Cited 0

Analysis of critical architectural and programming parameters in a hierarchical

SIGMETRICS '90 Proceedings of the 1990 ACM SIGMETRICS conference on Measurement and modeling of computer systems
Algorithms for scalable synchronization on shared-memory multiprocessors

ACM Transactions on Computer Systems (TOCS)
Scan primitives for vector computers

Proceedings of the 1990 ACM/IEEE conference on Supercomputing
Active messages: a mechanism for integrated communication and computation

ISCA '92 Proceedings of the 19th annual international symposium on Computer architecture
The Stanford FLASH multiprocessor

ISCA '94 Proceedings of the 21st annual international symposium on Computer architecture
An analytical model of high performance superscalar-based multiprocessors

PACT '95 Proceedings of the IFIP WG10.3 working conference on Parallel architectures and compilation techniques
Missing the memory wall: the case for processor/memory integration

ISCA '96 Proceedings of the 23rd annual international symposium on Computer architecture
Synchronization and communication in the T3E multiprocessor

Proceedings of the seventh international conference on Architectural support for programming languages and operating systems
The SGI Origin: a ccNUMA highly scalable server

Proceedings of the 24th annual international symposium on Computer architecture
Memory system characterization of commercial workloads

Proceedings of the 25th annual international symposium on Computer architecture
Active pages: a computation model for intelligent memory

Proceedings of the 25th annual international symposium on Computer architecture
Analytic evaluation of shared-memory systems with ILP processors

Proceedings of the 25th annual international symposium on Computer architecture
Microservers: a new memory semantics for massively parallel computing

ICS '99 Proceedings of the 13th international conference on Supercomputing
Mapping irregular applications to DIVA, a PIM-based data-intensive architecture

SC '99 Proceedings of the 1999 ACM/IEEE conference on Supercomputing
The Impulse Memory Controller

IEEE Transactions on Computers
Automatic Code Mapping on an Intelligent Memory Architecture

IEEE Transactions on Computers
Using a user-level memory thread for correlation prefetching

ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
The Architectural and Operating System Implications on the Performance of Synchronization on ccNUMA Multiprocessors

International Journal of Parallel Programming
A Case for Intelligent RAM

IEEE Micro
The Alpha 21264 Microprocessor

IEEE Micro
The Performance of Spin Lock Alternatives for Shared-Memory Multiprocessors

IEEE Transactions on Parallel and Distributed Systems
Analytic Evaluation of Shared-Memory Architectures

IEEE Transactions on Parallel and Distributed Systems
Compile-Time Based Performance Prediction

LCPC '99 Proceedings of the 12th International Workshop on Languages and Compilers for Parallel Computing
Database Architecture Optimized for the New Bottleneck: Memory Access

VLDB '99 Proceedings of the 25th International Conference on Very Large Data Bases
DBMSs on a Modern Processor: Where Does Time Go?

VLDB '99 Proceedings of the 25th International Conference on Very Large Data Bases
Architectural Support for Parallel Reductions in Scalable Shared-Memory Multiprocessors

Proceedings of the 2001 International Conference on Parallel Architectures and Compilation Techniques
Active I/O Switches in System Area Networks

HPCA '03 Proceedings of the 9th International Symposium on High-Performance Computer Architecture
Fast Collective Operations Using Shared and Remote Memory Access Protocols on Clusters

IPDPS '03 Proceedings of the 17th International Symposium on Parallel and Distributed Processing
What Will Have the Greatest Impact in 2010: The Processor, the Memory, or the Interconnect?

HPCA '02 Proceedings of the 8th International Symposium on High-Performance Computer Architecture
Architectural Support for Uniprocessor and Multiprocessor Active Memory Systems

IEEE Transactions on Computers
Cross-architecture performance predictions for scientific applications using parameterized models

Proceedings of the joint international conference on Measurement and modeling of computer systems
Cache Refill/Access Decoupling for Vector Machines

Proceedings of the 37th annual IEEE/ACM International Symposium on Microarchitecture
Scatter-Add in Data Parallel Architectures

HPCA '05 Proceedings of the 11th International Symposium on High-Performance Computer Architecture
Hardware Support for Bulk Data Movement in Server Platforms

ICCD '05 Proceedings of the 2005 International Conference on Computer Design
Fast synchronization on shared-memory multiprocessors: An architectural approach

Journal of Parallel and Distributed Computing - Special issue: Design and performance of networks for super-, cluster-, and grid-computing: Part I
Active memory operations

Active memory operations
Combinable memory-block transactions

Proceedings of the twentieth annual symposium on Parallelism in algorithms and architectures
Atomic Vector Operations on Chip Multiprocessors

ISCA '08 Proceedings of the 35th Annual International Symposium on Computer Architecture
IBM Power5 Chip: A Dual-Core Multithreaded Processor

IEEE Micro
Approximate analysis of general queuing networks

IBM Journal of Research and Development
Distributed Virtual Bit-Slice Synchronizer: A Scalable Hardware Barrier Mechanism for n-Dimensional Meshes

IEEE Transactions on Computers
Active Memory Processor for Network-on-Chip-Based Architecture

IEEE Transactions on Computers

Quantified Score

Hi-index	0.00

Visualization

Abstract

Inability to hide main memory latency has been increasingly limiting the performance of modern processors. The problem is worse in large-scale shared memory systems, where remote memory latencies are hundreds, and soon thousands, of processor cycles. To mitigate this problem, we propose an intelligent memory and cache coherence controller (AMC) that can execute Active Memory Operations (AMOs). AMOs are select operations sent to and executed on the home memory controller of data. AMOs can eliminate a significant number of coherence messages, minimize intranode and internode memory traffic, and create opportunities for parallelism. Our implementation of AMOs is cache-coherent and requires no changes to the processor core or DRAM chips.In this paper, we present the microarchitecture design of AMC, and the programming model of AMOs. We compare AMOs' performance to that of several other memory architectures on a variety of scientific and commercial benchmarks. Through simulation, we show that AMOs offer dramatic performance improvements for an important set of data-intensive operations, e.g., up to 50脳 faster barriers, 12脳 faster spinlocks, 8.5脳---15脳 faster stream/array operations, and 3脳 faster database queries. We also present an analytical model that can predict the performance benefits of using AMOs with decent accuracy. The silicon cost required to support AMOs is less than 1% of the die area of a typical high performance processor, based on a standard cell implementation.