On the effective bandwidth of interleaved memories in vector processor systems
IEEE Transactions on Computers
Performance evaluation of vector accesses in parallel memories using a skewed storage scheme
ISCA '86 Proceedings of the 13th annual international symposium on Computer architecture
An aperiodic storage scheme to reduce memory conflicts in vector processors
ISCA '89 Proceedings of the 16th annual international symposium on Computer architecture
Analysis of vector access performance on skewed interleaved memory
ISCA '89 Proceedings of the 16th annual international symposium on Computer architecture
Conflict-Free Vector Access Using a Dynamic Storage Scheme
IEEE Transactions on Computers
Data prefetching in multiprocessor vector cache memories
ISCA '91 Proceedings of the 18th annual international symposium on Computer architecture
Pseudo-randomly interleaved memory
ISCA '91 Proceedings of the 18th annual international symposium on Computer architecture
Increasing the number of strides for conflict-free vector access
ISCA '92 Proceedings of the 19th annual international symposium on Computer architecture
Conflict-free access of vectors with power-of-two strides
ICS '92 Proceedings of the 6th international conference on Supercomputing
Block, Multistride Vector, and FFT Accesses in Parallel Memory Systems
IEEE Transactions on Parallel and Distributed Systems
Synchronized access to streams in SIMD vector multiprocessors
ICS '94 Proceedings of the 8th international conference on Supercomputing
Bounding on the gain of optimizing data layout in vector processors
ICS '98 Proceedings of the 12th international conference on Supercomputing
Tarantula: a vector extension to the alpha architecture
ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
Bounding the gain of changing the number of memory modules in shared memory multiprocessors
Nordic Journal of Computing
Configurable parallel memory architecture for multimedia computers
Journal of Systems Architecture: the EUROMICRO Journal
Design and Implementation of High-Performance Memory Systems for Future Packet Buffers
Proceedings of the 36th annual IEEE/ACM International Symposium on Microarchitecture
A DRAM/SRAM Memory Scheme for Fast Packet Buffers
IEEE Transactions on Computers
Virtually Pipelined Network Memory
Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
Sams: single-affiliation multiple-stride parallel memory scheme
Proceedings of the 2008 workshop on Memory access on future processors: a solved problem?
Configurable data memory for multimedia processing
Journal of Signal Processing Systems - Special Issue: Embedded computing systems for DSP
Memory organization with multi-pattern parallel accesses
Proceedings of the conference on Design, automation and test in Europe
High-bandwidth Address Generation Unit
Journal of Signal Processing Systems
High-bandwidth network memory system through virtual pipelines
IEEE/ACM Transactions on Networking (TON)
SAMS multi-layout memory: providing multiple views of data to boost SIMD performance
Proceedings of the 24th ACM International Conference on Supercomputing
An Efficient Memory Organization for High-ILP Inner Modem Baseband SDR Processors
Journal of Signal Processing Systems
Scalable QoS-aware memory controller for high-bandwidth packet memory
IEEE Transactions on Very Large Scale Integration (VLSI) Systems
Elastic pipeline: addressing GPU on-chip shared memory bank conflicts
Proceedings of the 8th ACM International Conference on Computing Frontiers
Hi-index | 15.00 |
Address transformation schemes, such as skewing and linear transformations, have been proposed to achieve conflict-free access for streams with constant stride. However, this is achieved only for some strides. In this paper, we extend these schemes to achieve this conflict-free access for a larger number of strides. The basic idea is to perform an out-of-order access to a stream of fixed length. This stream is then stored in a local memory and used in subsequent instructions. This mode of operation is suitable for vector processors and for processors with decoupled access. The scheme and mode of operation proposed produce the largest possible number of conflict-free strides. Memory systems with any ratio between the number of memory modules and memory latency are considered. The hardware for address calculations and access control is described and shown to be of similar complexity as that required for access in order.