On the effective bandwidth of interleaved memories in vector processor systems
IEEE Transactions on Computers
Transforming FORTRAN DO loops to improve performance on vector architectures
ACM Transactions on Mathematical Software (TOMS)
Automatic translation of FORTRAN programs to vector form
ACM Transactions on Programming Languages and Systems (TOPLAS)
On Linear Skewing Schemes and d-Ordered Vectors
IEEE Transactions on Computers
Vector access performance in parallel memories using skewed storage scheme
IEEE Transactions on Computers
Dependence Analysis for Supercomputing
Dependence Analysis for Supercomputing
Increasing the number of strides for conflict-free vector access
ISCA '92 Proceedings of the 19th annual international symposium on Computer architecture
Conflict-free access of vectors with power-of-two strides
ICS '92 Proceedings of the 6th international conference on Supercomputing
A case for Wafer-scale interconnected memory arrays
Proceedings of the 1992 ACM/IEEE conference on Supercomputing
Scalable parallel memory architecture with a skew scheme
ICS '93 Proceedings of the 7th international conference on Supercomputing
Synchronized access to streams in SIMD vector multiprocessors
ICS '94 Proceedings of the 8th international conference on Supercomputing
Reducing inter-vector-conflicts in complex memory systems
ICS '96 Proceedings of the 10th international conference on Supercomputing
Minimizing Conflicts Between Vector Streams in Interleaved Memory Systems
IEEE Transactions on Computers
Increasing the effective bandwidth of complex memory systems in multivector processors
Supercomputing '96 Proceedings of the 1996 ACM/IEEE conference on Supercomputing
Conflict-Free Access for Streams in Multimodule Memories
IEEE Transactions on Computers
Block, Multistride Vector, and FFT Accesses in Parallel Memory Systems
IEEE Transactions on Parallel and Distributed Systems
Configurable parallel memory architecture for multimedia computers
Journal of Systems Architecture: the EUROMICRO Journal
Memory access reordering in vector processors
HPCA '95 Proceedings of the 1st IEEE Symposium on High-Performance Computer Architecture
Multiaccess Memory System for Attached SIMD Computer
IEEE Transactions on Computers
Sams: single-affiliation multiple-stride parallel memory scheme
Proceedings of the 2008 workshop on Memory access on future processors: a solved problem?
Configurable data memory for multimedia processing
Journal of Signal Processing Systems - Special Issue: Embedded computing systems for DSP
Memory organization with multi-pattern parallel accesses
Proceedings of the conference on Design, automation and test in Europe
High-bandwidth Address Generation Unit
Journal of Signal Processing Systems
SAMS multi-layout memory: providing multiple views of data to boost SIMD performance
Proceedings of the 24th ACM International Conference on Supercomputing
An Efficient Memory Organization for High-ILP Inner Modem Baseband SDR Processors
Journal of Signal Processing Systems
Elastic pipeline: addressing GPU on-chip shared memory bank conflicts
Proceedings of the 8th ACM International Conference on Computing Frontiers
Hi-index | 14.99 |
An approach whereby conflict-free access of any constant stride can be made by selecting a storage scheme for each vector based on the accessing patterns used with that vector is considered. By factoring the stride into two components, one a power of 2 and the other relatively prime to 2, a storage scheme that allows conflict-free access to the vector using the specified stride can be synthesized. All such schemes are based on a variation of the row rotation mechanism proposed by P. Budnik and D. Kuck. Each storage scheme is based on two parameters, one describing the type of rotation to perform and the other describing the amount of memory to be rotated as a single block. The performance of the memory under access strides other than the stride used to specify the storage scheme is also considered. Modeling these other strides represents a vector being accessed with multiple strides as well as situations when the stride cannot be determined prior to initializing the vector. Simulation results show that if a single buffer is added to each memory port, then the average performance of the dynamic scheme surpasses that of the interleaved scheme for arbitrary stride accesses.