Direct distributed memory access for CMPs

Authors:
Weiwei Fu;Li Liu;Tianzhou Chen
Affiliations:
-;-;-
Venue:
Journal of Parallel and Distributed Computing
Year:
2014

Citing 33
Cited 0

Hitting the memory wall: implications of the obvious

ACM SIGARCH Computer Architecture News
A performance comparison of contemporary DRAM architectures

ISCA '99 Proceedings of the 26th annual international symposium on Computer architecture
Memory access scheduling

Proceedings of the 27th annual international symposium on Computer architecture
A permutation-based page interleaving scheme to reduce row-buffer conflicts and exploit data locality

Proceedings of the 33rd annual ACM/IEEE international symposium on Microarchitecture
Route packets, not wires: on-chip inteconnection networks

Proceedings of the 38th annual Design Automation Conference
Networks on Chips: A New SoC Paradigm

Computer
Principles and Practices of Interconnection Networks

Principles and Practices of Interconnection Networks
Memory Controller Optimizations for Web Servers

Proceedings of the 37th annual IEEE/ACM International Symposium on Microarchitecture
Micro-architecture techniques in the intel® E8870 scalable memory controller

WMPI '04 Proceedings of the 3rd workshop on Memory performance issues: in conjunction with the 31st international symposium on computer architecture
A study of performance impact of memory controller features in multi-processor server environment

WMPI '04 Proceedings of the 3rd workshop on Memory performance issues: in conjunction with the 31st international symposium on computer architecture
Cell Multiprocessor Communication Network: Built for Speed

IEEE Micro
SPEC CPU2006 benchmark descriptions

ACM SIGARCH Computer Architecture News
Die Stacking (3D) Microarchitecture

Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
Solutions for Real Chip Implementation Issues of NoC and Their Application to Memory-Centric NoC

NOCS '07 Proceedings of the First International Symposium on Networks-on-Chip
On-Chip Interconnection Architecture of the Tile Processor

IEEE Micro
3D-Stacked Memory Architectures for Multi-core Processors

ISCA '08 Proceedings of the 35th Annual International Symposium on Computer Architecture
The PARSEC benchmark suite: characterization and architectural implications

Proceedings of the 17th international conference on Parallel architectures and compilation techniques
Validity of the single processor approach to achieving large scale computing capabilities

AFIPS '67 (Spring) Proceedings of the April 18-20, 1967, spring joint computer conference
Achieving predictable performance through better memory controller placement in many-core CMPs

Proceedings of the 36th annual international symposium on Computer architecture
Rethinking DRAM design and organization for energy-constrained multi-cores

Proceedings of the 37th annual international symposium on Computer architecture
A Low-Latency and Memory-Efficient On-chip Network

NOCS '10 Proceedings of the 2010 Fourth ACM/IEEE International Symposium on Networks-on-Chip
A Network Congestion-Aware Memory Controller

NOCS '10 Proceedings of the 2010 Fourth ACM/IEEE International Symposium on Networks-on-Chip
A case for NUMA-aware contention management on multicore systems

Proceedings of the 19th international conference on Parallel architectures and compilation techniques
Supporting distributed shared memory on multi-core network-on-chips using a dual microcoded controller

Proceedings of the Conference on Design, Automation and Test in Europe
An efficient distributed memory interface for many-core platform with 3D stacked DRAM

Proceedings of the Conference on Design, Automation and Test in Europe
An SDRAM-aware router for networks-on-chip

IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems - Special section on the ACM IEEE international conference on formal methods and models for codesign (MEMOCODE) 2009
Distributed Memory Management Units Architecture for NoC-based CMPs

CIT '10 Proceedings of the 2010 10th IEEE International Conference on Computer and Information Technology
Memory management in NUMA multicore systems: trapped between cache contention and interconnect overhead

Proceedings of the international symposium on Memory management
HOPE: hotspot congestion control for Clos network on chip

NOCS '11 Proceedings of the Fifth ACM/IEEE International Symposium on Networks-on-Chip
The gem5 simulator

ACM SIGARCH Computer Architecture News
Memory controllers for high-performance and real-time MPSoCs: requirements, architectures, and future trends

CODES+ISSS '11 Proceedings of the seventh IEEE/ACM/IFIP international conference on Hardware/software codesign and system synthesis
Traffic management: a holistic approach to memory placement on NUMA systems

Proceedings of the eighteenth international conference on Architectural support for programming languages and operating systems
Addressing End-to-End Memory Access Latency in NoC-Based Multicores

MICRO-45 Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture

Quantified Score

Hi-index	0.00

Visualization

Abstract

On-chip distributed memory has emerged as a promising memory organization for future many-core systems, since it efficiently exploits memory level parallelism and can lighten off the load on each memory module by providing a comparable number of memory interfaces with on-chip cores. The packet-based memory access model (PDMA) has provided a scalable and flexible solution for distributed memory management, but suffers from complicated and costly on-chip network protocol translation and massive interferences among packets, which leads to unpredictable performance. In this paper we propose a direct distributed memory access (DDMA) model, in which remote memory can be directly accessed by local cores via remote-to-local virtualization, without network protocol translation. From the perspective of local cores, remote memory controllers (MC) can be directly manipulated through accessing the local agent MC, which is responsible for accessing remote memory through high-performance inter-tile communication. We further discuss some detailed architecture supports for the DDMA model, including the memory interface design, work flow and the protocols involved. Simulation results of executing PARSEC benchmarks show that our DDMA architecture outperforms PDMA in terms of both average memory access latency and IPC by 17.8% and 16.6% respectively on average. Besides, DDMA can better manage congested memory traffic, since a reduction of bandwidth in running memory-intensive SPEC2006 workloads only incurs 18.9% performance penalty, compared with 38.3% for PDMA.