Sams: single-affiliation multiple-stride parallel memory scheme

Authors:
Chunyang Gou;Georgi K. Kuzmanov;Georgi N. Gaydadjiev
Affiliations:
Faculty of Electrical Engineering, Mathematics and Computer Science, TU Delft, Delft, Netherlands;Faculty of Electrical Engineering, Mathematics and Computer Science, TU Delft, Delft, Netherlands;Faculty of Electrical Engineering, Mathematics and Computer Science, TU Delft, Delft, Netherlands
Venue:
Proceedings of the 2008 workshop on Memory access on future processors: a solved problem?
Year:
2008

Citing 22
Cited 3

Vector Computer Memory Bank Contention

IEEE Transactions on Computers
Conflict-Free Vector Access Using a Dynamic Storage Scheme

IEEE Transactions on Computers
Increased Memory Performance During Vector Accesses Through the Use of Linear Address Transformations

IEEE Transactions on Computers
Performance of cached DRAM organizations in vector supercomputers

ISCA '93 Proceedings of the 20th annual international symposium on computer architecture
Vector architectures: past, present and future

ICS '98 Proceedings of the 12th international conference on Supercomputing
A Comparative Analysis of Cache Designs for Vector Processing

IEEE Transactions on Computers
The CRAY-1 computer system

Communications of the ACM - Special issue on computer architecture
Architectural and application: the performance of the NEC SX-4 on the NCAR benchmark suite

Supercomputing '96 Proceedings of the 1996 ACM/IEEE conference on Supercomputing
Speculative dynamic vectorization

ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
Cache performance in vector supercomputers

Proceedings of the 1994 ACM/IEEE conference on Supercomputing
Conflict-Free Access for Streams in Multimodule Memories

IEEE Transactions on Computers
Block, Multistride Vector, and FFT Accesses in Parallel Memory Systems

IEEE Transactions on Parallel and Distributed Systems
The AMD Opteron Processor for Multiprocessor Servers

IEEE Micro
Scalable vector media-processors for embedded systems

Scalable vector media-processors for embedded systems
Niagara: A 32-Way Multithreaded Sparc Processor

IEEE Micro
Optimizing data permutations for SIMD devices

Proceedings of the 2006 ACM SIGPLAN conference on Programming language design and implementation
Auto-vectorization of interleaved data for SIMD

Proceedings of the 2006 ACM SIGPLAN conference on Programming language design and implementation
POWER5 System microarchitecture

IBM Journal of Research and Development - POWER5 and packaging
Introduction to the cell multiprocessor

IBM Journal of Research and Development - POWER5 and packaging
The Organization and Use of Parallel Memories

IEEE Transactions on Computers
IBM POWER6 microarchitecture

IBM Journal of Research and Development
POWER4 system microarchitecture

IBM Journal of Research and Development

SAMS multi-layout memory: providing multiple views of data to boost SIMD performance

Proceedings of the 24th ACM International Conference on Supercomputing
Extending the cell SPE with energy efficient branch prediction

EuroPar'10 Proceedings of the 16th international Euro-Par conference on Parallel processing: Part I
Elastic pipeline: addressing GPU on-chip shared memory bank conflicts

Proceedings of the 8th ACM International Conference on Computing Frontiers

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper, we analyze the problem of supporting conflict-free access for multiple stride families in parallel memory schemes targeted for SIMD processing systems. We propose a Single-Affiliation Multiple-Stride (SAMS) scheme to support both unit-stride and strided conflict-free vector memory accesses. We compare our scheme against other previously proposed techniques using buffers and inter-vector out-of-order access. The main advantage of our proposal is that the atomic parallel access is supported without limiting the vector lengths. This provides better support when short vectors are considered. Our scheme also has the merit of better memory module resources utilization compared to the solutions with additional modules. Synthesis results for reconfigurable platform Virtex2-Pro FPGA indicate that the address translation of the SAMS scheme has efficient hardware implementation, which has a logic delay of less than 3 ns and trivial hardware resource utilization.