Increasing the number of strides for conflict-free vector access
ISCA '92 Proceedings of the 19th annual international symposium on Computer architecture
Performance of cached DRAM organizations in vector supercomputers
ISCA '93 Proceedings of the 20th annual international symposium on computer architecture
Access ordering and effective memory bandwidth
Access ordering and effective memory bandwidth
Generating local addresses and communication sets for data-parallel programs
Journal of Parallel and Distributed Computing
Vector multiprocessors with arbitrated memory access
ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
Streamlining data cache access with fast address calculation
ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
Design and evaluation of dynamic access ordering hardware
ICS '96 Proceedings of the 10th international conference on Supercomputing
Speculative execution via address prediction and data prefetching
ICS '97 Proceedings of the 11th international conference on Supercomputing
Design of the 21174 memory controller for DIGITAL Personal Workstations
Digital Technical Journal
Correlated load-address predictors
ISCA '99 Proceedings of the 26th annual international symposium on Computer architecture
Access order to avoid inter-vector-conflicts in complex memory systems
IPPS '95 Proceedings of the 9th International Symposium on Parallel Processing
Impulse: Building a Smarter Memory Controller
HPCA '99 Proceedings of the 5th International Symposium on High Performance Computer Architecture
Command Vector Memory Systems: High Performance at Low Cost
PACT '98 Proceedings of the 1998 International Conference on Parallel Architectures and Compilation Techniques
Memory System Support for Image Processing
PACT '99 Proceedings of the 1999 International Conference on Parallel Architectures and Compilation Techniques
Leveraging cache coherence in active memory systems
ICS '02 Proceedings of the 16th international conference on Supercomputing
Memory scheduling for modern microprocessors
ACM Transactions on Computer Systems (TOCS)
High-bandwidth Address Generation Unit
Journal of Signal Processing Systems
Return data interleaving for multi-channel embedded CMPs systems
IEEE Transactions on Very Large Scale Integration (VLSI) Systems
Hi-index | 0.00 |
This paper presents mathematical foundations for the design of a memory controller subcomponent that helps to bridge the processor/memory performance gap for applications with strided access patterns. The Parallel Vector Access (PVA) unit exploits the regularity of vectors or streams to access them efficiently in parallel on a multi-bank SDRAM memory system. The PVA unit performs scatter/gather operations so that only the elements accessed by the application are transmitted across the system bus. Vector operations are broadcast in parallel to all memory banks, each of which implements an efficient algorithm to determine which vector elements it holds. Earlier performance evaluations have demonstrated that our PVA implementation loads elements up to 32.8 times faster than a conventional memory system and 3.3 times faster than a pipelined vector unit, without hurting the performance of normal cache-line fills. Here we present the underlying PVA algorithms for both word interleaved and cache-line inter-leaved memory systems.