A Simulation Study of the CRAY X-MP Memory System
IEEE Transactions on Computers
Introduction to algorithms
Data relocation and prefetching for programs with large data sets
MICRO 27 Proceedings of the 27th annual international symposium on Microarchitecture
STiNG: a CC-NUMA computer system for the commercial marketplace
ISCA '96 Proceedings of the 23rd annual international symposium on Computer architecture
Synchronization and communication in the T3E multiprocessor
Proceedings of the seventh international conference on Architectural support for programming languages and operating systems
Design and evaluation of dynamic access ordering hardware
ICS '96 Proceedings of the 10th international conference on Supercomputing
Active pages: a computation model for intelligent memory
Proceedings of the 25th annual international symposium on Computer architecture
Increasing TLB reach using superpages backed by shadow memory
Proceedings of the 25th annual international symposium on Computer architecture
A bandwidth-efficient architecture for media processing
MICRO 31 Proceedings of the 31st annual ACM/IEEE international symposium on Microarchitecture
The directory-based cache coherence protocol for the DASH multiprocessor
ISCA '90 Proceedings of the 17th annual international symposium on Computer Architecture
Mapping irregular applications to DIVA, a PIM-based data-intensive architecture
SC '99 Proceedings of the 1999 ACM/IEEE conference on Supercomputing
IEEE Transactions on Computers
Using a user-level memory thread for correlation prefetching
ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
Pixel Processing in a Memory Controller
IEEE Computer Graphics and Applications
IEEE Micro
Compiling for the Impulse Memory Controller
Proceedings of the 2001 International Conference on Parallel Architectures and Compilation Techniques
Cache Coherence Protocol Design for Active Memory Systems
PDPTA '02 Proceedings of the International Conference on Parallel and Distributed Processing Techniques and Applications - Volume 1
Impulse: Building a Smarter Memory Controller
HPCA '99 Proceedings of the 5th International Symposium on High Performance Computer Architecture
Command Vector Memory Systems: High Performance at Low Cost
PACT '98 Proceedings of the 1998 International Conference on Parallel Architectures and Compilation Techniques
Active Memory Techniques for ccNUMA Multiprocessors
IPDPS '03 Proceedings of the 17th International Symposium on Parallel and Distributed Processing
Reevaluating Online Superpage Promotion with Hardware Support
HPCA '01 Proceedings of the 7th International Symposium on High-Performance Computer Architecture
Hi-index | 0.00 |
As processor performance continues to improve at a rate much higher than DRAM and network performance, we are approaching a time when large-scale distributed shared memory systems will have remote memory latencies measured in tens of thousands of processor cycles. The Impulse memory system architecture adds an optional level of address indirection at the memory controller. Applications can use this level of indirection to control how data is accessed and cached and thereby improve cache and bus utilization and reduce the number of memory accesses required. Previous Impulse work focuses on uniprocessor systems and relies on software to flush processor caches when necessary to ensure data coherence. In this paper, we investigate an extension of Impulse to multiprocessor systems that extends the coherence protocol to maintain data coherence without requiring software-directed cache flushing. Specifically, the multiprocessor Impulse controller can gather/scatter data across the network while its coherence protocol guarantees that each gather request gets coherent data and each scatter request updates every coherent replica in the system. Our simulation results demonstrate that the proposed system can significantly outperform conventional systems, achieving an average speedup of 9X on four memory-bound benchmarks on a 32-processor system.