Inter-core cooperative TLB for chip multiprocessors

Authors:
Abhishek Bhattacharjee;Margaret Martonosi
Affiliations:
Princeton University, Princeton, NJ, USA;Princeton University, Princeton, NJ, USA
Venue:
Proceedings of the fifteenth edition of ASPLOS on Architectural support for programming languages and operating systems
Year:
2010

Citing 21
Cited 12

The interaction of architecture and operating system design

ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
A simulation based study of TLB performance

ISCA '92 Proceedings of the 19th annual international symposium on Computer architecture
Design tradeoffs for software-managed TLBs

ISCA '93 Proceedings of the 20th annual international symposium on computer architecture
Architectural support for translation table management in large address space machines

ISCA '93 Proceedings of the 20th annual international symposium on computer architecture
Surpassing the TLB performance of superpages with less operating system support

ASPLOS VI Proceedings of the sixth international conference on Architectural support for programming languages and operating systems
Performance of the VAX-11/780 translation buffer: simulation and measurement

ACM Transactions on Computer Systems (TOCS)
Prefetching using Markov predictors

Proceedings of the 24th annual international symposium on Computer architecture
Options for dynamic address translation in COMAs

Proceedings of the 25th annual international symposium on Computer architecture
A look at several memory management units, TLB-refill mechanisms, and page table organizations

Proceedings of the eighth international conference on Architectural support for programming languages and operating systems
Recency-based TLB preloading

Proceedings of the 27th annual international symposium on Computer architecture
Characterizing the d-TLB behavior of SPEC CPU2000 benchmarks

SIGMETRICS '02 Proceedings of the 2002 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Going the distance for TLB prefetching: an application-driven study

ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
Virtual Memory in Contemporary Microprocessors

IEEE Micro
Effective Hardware-Based Data Prefetching for High-Performance Processors

IEEE Transactions on Computers
Software-Managed Address Translation

HPCA '97 Proceedings of the 3rd IEEE Symposium on High-Performance Computer Architecture
Use of superpages and subblocking in the address translation hierarchy

Use of superpages and subblocking in the address translation hierarchy
A Prefetch Taxonomy

IEEE Transactions on Computers
Multifacet's general execution-driven multiprocessor simulator (GEMS) toolset

ACM SIGARCH Computer Architecture News - Special issue: dasCMP'05
Fixed and Adaptive Sequential Prefetching in Shared Memory Multiprocessors

ICPP '93 Proceedings of the 1993 International Conference on Parallel Processing - Volume 01
The PARSEC benchmark suite: characterization and architectural implications

Proceedings of the 17th international conference on Parallel architectures and compilation techniques
Characterizing the TLB Behavior of Emerging Parallel Workloads on Chip Multiprocessors

PACT '09 Proceedings of the 2009 18th International Conference on Parallel Architectures and Compilation Techniques

Translation caching: skip, don't walk (the page table)

Proceedings of the 37th annual international symposium on Computer architecture
Synergistic TLBs for High Performance Address Translation in Chip Multiprocessors

MICRO '43 Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture
SpecTLB: a mechanism for speculative address translation

Proceedings of the 38th annual international symposium on Computer architecture
Global-aware and multi-order context-based prefetching for high-performance processors

International Journal of High Performance Computing Applications
Revisiting hardware-assisted page walks for virtualized systems

Proceedings of the 39th Annual International Symposium on Computer Architecture
Algorithm-level Feedback-controlled Adaptive data prefetcher: Accelerating data access for high-performance processors

Parallel Computing
PS-TLB: Leveraging page classification information for fast, scalable and efficient translation for future CMPs

ACM Transactions on Architecture and Code Optimization (TACO) - Special Issue on High-Performance Embedded Architectures and Compilers
CoLT: Coalesced Large-Reach TLBs

MICRO-45 Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture
Efficient virtual memory for big memory servers

Proceedings of the 40th Annual International Symposium on Computer Architecture
Large-reach memory management unit caches

Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture
Architectural support for address translation on GPUs: designing memory management units for CPU/GPUs with unified address spaces

Proceedings of the 19th international conference on Architectural support for programming languages and operating systems
Cache isolation for virtualization of mixed general-purpose and real-time systems

Journal of Systems Architecture: the EUROMICRO Journal

Quantified Score

Hi-index	0.00

Visualization

Abstract

Translation Lookaside Buffers (TLBs) are commonly employed in modern processor designs and have considerable impact on overall system performance. A number of past works have studied TLB designs to lower access times and miss rates, specifically for uniprocessors. With the growing dominance of chip multiprocessors (CMPs), it is necessary to examine TLB performance in the context of parallel workloads. This work is the first to present TLB prefetchers that exploit commonality in TLB miss patterns across cores in CMPs. We propose and evaluate two Inter-Core Cooperative (ICC) TLB prefetching mechanisms, assessing their effectiveness at eliminating TLB misses both individually and together. Our results show these approaches require at most modest hardware and can collectively eliminate 19% to 90% of data TLB (D-TLB) misses across the surveyed parallel workloads. We also compare performance improvements across a range of hardware and software implementation possibilities. We find that while a fully-hardware implementation results in average performance improvements of 8-46% for a range of TLB sizes, a hardware/software approach yields improvements of 4-32%. Overall, our work shows that TLB prefetchers exploiting inter-core correlations can effectively eliminate TLB misses.