Moving Address Translation Closer to Memory in Distributed Shared-Memory Multiprocessors

Authors:
Xiaogang Qiu;Michel Dubois
Affiliations:
-;IEEE
Venue:
IEEE Transactions on Parallel and Distributed Systems
Year:
2005

Citing 35
Cited 3

Coherency for multiprocessor virtual address caches

ASPLOS II Proceedings of the second international conference on Architectual support for programming languages and operating systems
Organization and performance of a two-level virtual-real cache hierarchy

ISCA '89 Proceedings of the 16th annual international symposium on Computer architecture
Performance evaluation of memory consistency models for shared-memory multiprocessors

ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
Comparative performance evaluation of cache-coherent NUMA and COMA architectures

ISCA '92 Proceedings of the 19th annual international symposium on Computer architecture
A simulation based study of TLB performance

ISCA '92 Proceedings of the 19th annual international symposium on Computer architecture
Tradeoffs in supporting two page sizes

ISCA '92 Proceedings of the 19th annual international symposium on Computer architecture
DDM: A Cache-Only Memory Architecture

Computer
Architecture support for single address space operating systems

ASPLOS V Proceedings of the fifth international conference on Architectural support for programming languages and operating systems
Design tradeoffs for software-managed TLBs

ISCA '93 Proceedings of the 20th annual international symposium on computer architecture
Architectural support for translation table management in large address space machines

ISCA '93 Proceedings of the 20th annual international symposium on computer architecture
The Stanford FLASH multiprocessor

ISCA '94 Proceedings of the 21st annual international symposium on Computer architecture
Surpassing the TLB performance of superpages with less operating system support

ASPLOS VI Proceedings of the sixth international conference on Architectural support for programming languages and operating systems
Sharing and protection in a single-address-space operating system

ACM Transactions on Computer Systems (TOCS) - Special issue on computer architecture
Performance of the VAX-11/780 translation buffer: simulation and measurement

ACM Transactions on Computer Systems (TOCS)
COMA-F: a non-hierarchical cache only memory architecture

COMA-F: a non-hierarchical cache only memory architecture
The SPLASH-2 programs: characterization and methodological considerations

ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
Reducing TLB and memory overhead using online superpage promotion

ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
CAT—caching address tags: a technique for reducing area cost of on-chip caches

ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
High-bandwidth address translation for multiple-issue processors

ISCA '96 Proceedings of the 23rd annual international symposium on Computer architecture
Compiler-directed page coloring for multiprocessors

Proceedings of the seventh international conference on Architectural support for programming languages and operating systems
Options for dynamic address translation in COMAs

Proceedings of the 25th annual international symposium on Computer architecture
Tolerating late memory traps in ILP processors

ISCA '99 Proceedings of the 26th annual international symposium on Computer architecture
The use of multithreading for exception handling

Proceedings of the 32nd annual ACM/IEEE international symposium on Microarchitecture
Uniprocessor Virtual Memory without TLBs

IEEE Transactions on Computers
Analysis of Cache Performance for Operating Systems and Multiprogramming

Analysis of Cache Performance for Operating Systems and Multiprogramming
The MIPS R10000 Superscalar Microprocessor

IEEE Micro
UltraSparc I: A Four-Issue Processor Supporting Multimedia

IEEE Micro
Virtual-Address Caches Part 1: Problems and Solutions in Uniprocessors

IEEE Micro
Virtual-Address Caches, Part 2: Multiprocessor Issues

IEEE Micro
Hardware Versus Software Implementation of COMA

ICPP '97 Proceedings of the international Conference on Parallel Processing
Software-Managed Address Translation

HPCA '97 Proceedings of the 3rd IEEE Symposium on High-Performance Computer Architecture
Towards Virtually-Addressed Memory Hierarchies

HPCA '01 Proceedings of the 7th International Symposium on High-Performance Computer Architecture
TLB For Free: In-Cache Address Translation For A Multiprocessor

TLB For Free: In-Cache Address Translation For A Multiprocessor
Towards virtually addressed memory hierarchies

Towards virtually addressed memory hierarchies
Fighting the memory wall with assisted execution

Proceedings of the 1st conference on Computing frontiers

A low-cost memory remapping scheme for address bus protection

Proceedings of the 15th international conference on Parallel architectures and compilation techniques
A low-cost memory remapping scheme for address bus protection

Journal of Parallel and Distributed Computing
Reducing Virtual-to-Physical address translation overhead in Distributed Shared Memory based multi-core Network-on-Chips according to data property

Computers and Electrical Engineering

Quantified Score

Hi-index	0.00

Visualization

Abstract

To support a global virtual memory space, an architecture must translate virtual addresses dynamically. In current processors, the translation is done in a TLB (Translation Lookaside Buffer), before or in parallel with the first-level cache access. As processor technology improves at a rapid pace and the working sets of new applications grow insatiably, the latency and bandwidth demands on the TLB are difficult to meet, especially in multiprocessor systems, which run larger applications and are plagued by the TLB consistency problem. We describe and compare five options for virtual address translation in the context of Distributed Shared Memory (DSM) multiprocessors, including CC-NUMAs (Cache-Coherent Non-Uniform Memory Access Architectures) and COMAs (Cache Only Memory Access Architectures). In CC-NUMAs, moving the TLB to shared memory is a bad idea because page placement, migration, and replication are all constrained by the virtual page address, which greatly affects processor node access locality. In the context of COMAs, the allocation of pages to processor nodes is not as critical because memory blocks can dynamically migrate and replicate freely among nodes. As the address translation is done deeper in the memory hierarchy, the frequency of translations drops because of the filtering effect. We also observe that the TLB is very effective when it is merged with the shared-memory, because of the sharing and prefetching effects and because there is no need to maintain TLB consistency. Even if the effectiveness of the TLB merged with the shared memory is very high, we also show that the TLB can be removed in a system with address translation done in memory because the frequency of translations is very low.