The design space of data-parallel memory systems

Authors:
Jung Ho Ahn;Mattan Erez;William J. Dally
Affiliations:
Stanford University, Stanford, California;Stanford University, Stanford, California;Stanford University, Stanford, California
Venue:
Proceedings of the 2006 ACM/IEEE conference on Supercomputing
Year:
2006

Citing 16
Cited 14

On the effective bandwidth of interleaved memories in vector processor systems

IEEE Transactions on Computers
A Simulation Study of the CRAY X-MP Memory System

IEEE Transactions on Computers
Cray X-MP: The Birth of a Supercomputer

Computer
Pseudo-randomly interleaved memory

ISCA '91 Proceedings of the 18th annual international symposium on Computer architecture
Memory access scheduling

Proceedings of the 27th annual international symposium on Computer architecture
The CRAY-1 computer system

Communications of the ACM - Special issue on computer architecture
High-Bandwidth Interleaved Memories for Vector Processors - A Simulation Study

IEEE Transactions on Computers
Lockup-free instruction fetch/prefetch cache organization

ISCA '81 Proceedings of the 8th annual symposium on Computer Architecture
Access Order and Effective Bandwidth for Streams on a Direct Rambus Memory

HPCA '99 Proceedings of the 5th International Symposium on High Performance Computer Architecture
Command Vector Memory Systems: High Performance at Low Cost

PACT '98 Proceedings of the 1998 International Conference on Parallel Architectures and Compilation Techniques
Modern dram architectures

Modern dram architectures
Scalable vector media-processors for embedded systems

Scalable vector media-processors for embedded systems
Evaluating the Imagine Stream Architecture

Proceedings of the 31st annual international symposium on Computer architecture
Analysis and Performance Results of a Molecular Modeling Application on Merrimac

Proceedings of the 2004 ACM/IEEE conference on Supercomputing
Merrimac: Supercomputing with Streams

Proceedings of the 2003 ACM/IEEE conference on Supercomputing
Data Parallel Address Architecture

IEEE Computer Architecture Letters

Executing irregular scientific applications on stream architectures

Proceedings of the 21st annual international conference on Supercomputing
Tradeoff between data-, instruction-, and thread-level parallelism in stream processors

Proceedings of the 21st annual international conference on Supercomputing
Self-Optimizing Memory Controllers: A Reinforcement Learning Approach

ISCA '08 Proceedings of the 35th Annual International Symposium on Computer Architecture
Using reconfigurable logic to optimise GPU memory accesses

Proceedings of the conference on Design, automation and test in Europe
Exploiting loop-dependent stream reuse for stream processors

Proceedings of the 17th international conference on Parallel architectures and compilation techniques
Partial conflict-relieving programmable address shuffler for parallel memories in multi-core processor

Proceedings of the 2009 Asia and South Pacific Design Automation Conference
Partial access conflict-relieving programmable address shuffler for parallel memory system in multi-core processor

Microprocessors & Microsystems
A Network Congestion-Aware Memory Controller

NOCS '10 Proceedings of the 2010 Fourth ACM/IEEE International Symposium on Networks-on-Chip
Exploiting the reuse supplied by loop-dependent stream references for stream processors

ACM Transactions on Architecture and Code Optimization (TACO)
Improving System Energy Efficiency with Memory Rank Subsetting

ACM Transactions on Architecture and Code Optimization (TACO)
A systematic design space exploration approach to customising multi-processor architectures: exemplified using graphics processors

Transactions on High-Performance Embedded Architectures and Compilers IV
Hybrid DRAM/PRAM-based main memory for single-chip CPU/GPU

Proceedings of the 49th Annual Design Automation Conference
A distributed interleaving scheme for efficient access to WideIO DRAM memory

Proceedings of the eighth IEEE/ACM/IFIP international conference on Hardware/software codesign and system synthesis
A network congestion-aware memory subsystem for manycore

ACM Transactions on Embedded Computing Systems (TECS) - Special Section on Wireless Health Systems, On-Chip and Off-Chip Network Architectures

Quantified Score

Hi-index	0.00

Visualization

Abstract

Data-parallel memory systems must maintain a large number of outstanding memory references to fully use increasing DRAM bandwidth in the presence of rising latencies. Additionally, throughput is increasingly sensitive to the reference patterns due to the rising latency of issuing DRAM commands, switching between reads and writes, and precharging/activating internal DRAM banks. We study the design space of data-parallel memory systems in light of these trends of increasing concurrency, latency, and sensitivity to access patterns. We perform a detailed performance analysis of scientific and multimedia applications and micro-benchmarks, varying DRAM parameters and the memory-system configuration. We identify the interference between concurrent read and write memory-access threads, and bank conflicts, both within a single thread and across multiple threads, as the most critical factors affecting performance. We then develop hardware techniques to minimize throughput degradation. We advocate either relying on multiple concurrent accesses from a single memory-reference thread only, while sacrificing load-balance, or introducing new hardware to maintain both locality of reference and load-balance between multiple DRAM channels with multiple threads. We show that a low-cost configuration with only 16 channel-buffer entries achieves over 80% of peak throughput in most cases.