Efficient address remapping in distributed shared-memory systems

Authors:
Lixin Zhang;Mike Parker;John Carter
Affiliations:
IBM Austin Research Lab, Austin, TX;Cray Inc.;University of Utah, Salt Lake City, UT
Venue:
ACM Transactions on Architecture and Code Optimization (TACO)
Year:
2006

Citing 22
Cited 0

A Simulation Study of the CRAY X-MP Memory System

IEEE Transactions on Computers
Introduction to algorithms

Introduction to algorithms
Data relocation and prefetching for programs with large data sets

MICRO 27 Proceedings of the 27th annual international symposium on Microarchitecture
STiNG: a CC-NUMA computer system for the commercial marketplace

ISCA '96 Proceedings of the 23rd annual international symposium on Computer architecture
Synchronization and communication in the T3E multiprocessor

Proceedings of the seventh international conference on Architectural support for programming languages and operating systems
Design and evaluation of dynamic access ordering hardware

ICS '96 Proceedings of the 10th international conference on Supercomputing
Active pages: a computation model for intelligent memory

Proceedings of the 25th annual international symposium on Computer architecture
Increasing TLB reach using superpages backed by shadow memory

Proceedings of the 25th annual international symposium on Computer architecture
A bandwidth-efficient architecture for media processing

MICRO 31 Proceedings of the 31st annual ACM/IEEE international symposium on Microarchitecture
The directory-based cache coherence protocol for the DASH multiprocessor

ISCA '90 Proceedings of the 17th annual international symposium on Computer Architecture
Mapping irregular applications to DIVA, a PIM-based data-intensive architecture

SC '99 Proceedings of the 1999 ACM/IEEE conference on Supercomputing
The Impulse Memory Controller

IEEE Transactions on Computers
Using a user-level memory thread for correlation prefetching

ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
Pixel Processing in a Memory Controller

IEEE Computer Graphics and Applications
Baring It All to Software: Raw Machines

Computer
A Case for Intelligent RAM

IEEE Micro
Compiling for the Impulse Memory Controller

Proceedings of the 2001 International Conference on Parallel Architectures and Compilation Techniques
Cache Coherence Protocol Design for Active Memory Systems

PDPTA '02 Proceedings of the International Conference on Parallel and Distributed Processing Techniques and Applications - Volume 1
Impulse: Building a Smarter Memory Controller

HPCA '99 Proceedings of the 5th International Symposium on High Performance Computer Architecture
Command Vector Memory Systems: High Performance at Low Cost

PACT '98 Proceedings of the 1998 International Conference on Parallel Architectures and Compilation Techniques
Active Memory Techniques for ccNUMA Multiprocessors

IPDPS '03 Proceedings of the 17th International Symposium on Parallel and Distributed Processing
Reevaluating Online Superpage Promotion with Hardware Support

HPCA '01 Proceedings of the 7th International Symposium on High-Performance Computer Architecture

Quantified Score

Hi-index	0.00

Visualization

Abstract

As processor performance continues to improve at a rate much higher than DRAM and network performance, we are approaching a time when large-scale distributed shared memory systems will have remote memory latencies measured in tens of thousands of processor cycles. The Impulse memory system architecture adds an optional level of address indirection at the memory controller. Applications can use this level of indirection to control how data is accessed and cached and thereby improve cache and bus utilization and reduce the number of memory accesses required. Previous Impulse work focuses on uniprocessor systems and relies on software to flush processor caches when necessary to ensure data coherence. In this paper, we investigate an extension of Impulse to multiprocessor systems that extends the coherence protocol to maintain data coherence without requiring software-directed cache flushing. Specifically, the multiprocessor Impulse controller can gather/scatter data across the network while its coherence protocol guarantees that each gather request gets coherent data and each scatter request updates every coherent replica in the system. Our simulation results demonstrate that the proposed system can significantly outperform conventional systems, achieving an average speedup of 9X on four memory-bound benchmarks on a 32-processor system.