On the effective bandwidth of interleaved memories in vector processor systems
IEEE Transactions on Computers
Performance evaluation of vector accesses in parallel memories using a skewed storage scheme
ISCA '86 Proceedings of the 13th annual international symposium on Computer architecture
An aperiodic storage scheme to reduce memory conflicts in vector processors
ISCA '89 Proceedings of the 16th annual international symposium on Computer architecture
Analysis of vector access performance on skewed interleaved memory
ISCA '89 Proceedings of the 16th annual international symposium on Computer architecture
Conflict-Free Vector Access Using a Dynamic Storage Scheme
IEEE Transactions on Computers
Pseudo-randomly interleaved memory
ISCA '91 Proceedings of the 18th annual international symposium on Computer architecture
Increasing the number of strides for conflict-free vector access
ISCA '92 Proceedings of the 19th annual international symposium on Computer architecture
Block, Multistride Vector, and FFT Accesses in Parallel Memory Systems
IEEE Transactions on Parallel and Distributed Systems
Synchronized access to streams in SIMD vector multiprocessors
ICS '94 Proceedings of the 8th international conference on Supercomputing
Semi-linear and bi-base storage schemes classes: general overview and case study
ICS '95 Proceedings of the 9th international conference on Supercomputing
Proceedings of the 33rd annual ACM/IEEE international symposium on Microarchitecture
Conflict-Free Access for Streams in Multimodule Memories
IEEE Transactions on Computers
The virtual write queue: coordinating DRAM and last-level cache policies
Proceedings of the 37th annual international symposium on Computer architecture
Bit mapping for balanced PCM cell programming
Proceedings of the 40th Annual International Symposium on Computer Architecture
Hi-index | 0.00 |
An address mapping and an access order is presented for conflict-free access to vectors with any initial address and power-of-two strides. We show that for this conflict-free access it is necessary that the memory be unmatched and present an implementation for M=2T, where M is the number of modules and T the module latency. Moreover, the implementation allows the masking of the latency of the address calculation, of the mapper, and of the bus arbiter.