Going the distance for TLB prefetching: an application-driven study

Authors:
Gokul B. Kandiraju;Anand Sivasubramaniam
Affiliations:
The Pennsylvania State University, University Park, PA;The Pennsylvania State University, University Park, PA
Venue:
ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
Year:
2002

Citing 20
Cited 26

The interaction of architecture and operating system design

ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
A simulation based study of TLB performance

ISCA '92 Proceedings of the 19th annual international symposium on Computer architecture
Stride directed prefetching in scalar processors

MICRO 25 Proceedings of the 25th annual international symposium on Microarchitecture
Design tradeoffs for software-managed TLBs

ISCA '93 Proceedings of the 20th annual international symposium on computer architecture
Architectural support for translation table management in large address space machines

ISCA '93 Proceedings of the 20th annual international symposium on computer architecture
Shade: a fast instruction-set simulator for execution profiling

SIGMETRICS '94 Proceedings of the 1994 ACM SIGMETRICS conference on Measurement and modeling of computer systems
A software-controlled prefetching mechanism for software-managed TLBs

Microprocessing and Microprogramming
Performance of the VAX-11/780 translation buffer: simulation and measurement

ACM Transactions on Computer Systems (TOCS)
Using the SimOS machine simulator to study complex computer systems

ACM Transactions on Modeling and Computer Simulation (TOMACS)
MediaBench: a tool for evaluating and synthesizing multimedia and communicatons systems

MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
Increasing TLB reach using superpages backed by shadow memory

Proceedings of the 25th annual international symposium on Computer architecture
A look at several memory management units, TLB-refill mechanisms, and page table organizations

Proceedings of the eighth international conference on Architectural support for programming languages and operating systems
Prefetching Using Markov Predictors

IEEE Transactions on Computers - Special issue on cache memory and related problems
Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers

ISCA '90 Proceedings of the 17th annual international symposium on Computer Architecture
Recency-based TLB preloading

Proceedings of the 27th annual international symposium on Computer architecture
Data prefetch mechanisms

ACM Computing Surveys (CSUR)
Characterizing the d-TLB behavior of SPEC CPU2000 benchmarks

SIGMETRICS '02 Proceedings of the 2002 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Effective Hardware-Based Data Prefetching for High-Performance Processors

IEEE Transactions on Computers
Neighborhood Prefetching on Multiprocessors Using Instruction History

PACT '00 Proceedings of the 2000 International Conference on Parallel Architectures and Compilation Techniques
Use of superpages and subblocking in the address translation hierarchy

Use of superpages and subblocking in the address translation hierarchy

Characterizing the d-TLB behavior of SPEC CPU2000 benchmarks

SIGMETRICS '02 Proceedings of the 2002 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
TCP: Tag Correlating Prefetchers

HPCA '03 Proceedings of the 9th International Symposium on High-Performance Computer Architecture
Data Cache Prefetching Using a Global History Buffer

IEEE Micro
Efficient emulation of hardware prefetchers via event-driven helper threading

Proceedings of the 15th international conference on Parallel architectures and compilation techniques
Data prefetching in a cache hierarchy with high bandwidth and capacity

MEDEA '06 Proceedings of the 2006 workshop on MEmory performance: DEaling with Applications, systems and architectures
A comprehensive study of hardware/software approaches to improve TLB performance for java applications on embedded systems

Proceedings of the 2006 workshop on Memory system performance and correctness
Data prefetching in a cache hierarchy with high bandwidth and capacity

ACM SIGARCH Computer Architecture News
Accelerating two-dimensional page walks for virtualized systems

Proceedings of the 13th international conference on Architectural support for programming languages and operating systems
Server-based data push architecture for multi-processor environments

Journal of Computer Science and Technology
Stream chaining: exploiting multiple levels of correlation in data prefetching

Proceedings of the 36th annual international symposium on Computer architecture
COMPASS: a programmable data prefetcher using idle GPU shaders

Proceedings of the fifteenth edition of ASPLOS on Architectural support for programming languages and operating systems
Inter-core cooperative TLB for chip multiprocessors

Proceedings of the fifteenth edition of ASPLOS on Architectural support for programming languages and operating systems
Chameleon: Virtualizing idle acceleration cores of a heterogeneous multicore processor for caching and prefetching

ACM Transactions on Architecture and Code Optimization (TACO)
Context-aware TLB preloading for interference reduction in embedded multi-tasked systems

Proceedings of the 20th symposium on Great lakes symposium on VLSI
Timing local streams: improving timeliness in data prefetching

Proceedings of the 24th ACM International Conference on Supercomputing
Translation caching: skip, don't walk (the page table)

Proceedings of the 37th annual international symposium on Computer architecture
Synergistic TLBs for High Performance Address Translation in Chip Multiprocessors

MICRO '43 Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture
SpecTLB: a mechanism for speculative address translation

Proceedings of the 38th annual international symposium on Computer architecture
Global-aware and multi-order context-based prefetching for high-performance processors

International Journal of High Performance Computing Applications
IOMMU: strategies for mitigating the IOTLB bottleneck

ISCA'10 Proceedings of the 2010 international conference on Computer Architecture
Boosting mobile GPU performance with a decoupled access/execute fragment processor

Proceedings of the 39th Annual International Symposium on Computer Architecture
TLB Improvements for Chip Multiprocessors: Inter-Core Cooperative Prefetchers and Shared Last-Level TLBs

ACM Transactions on Architecture and Code Optimization (TACO)
CoLT: Coalesced Large-Reach TLBs

MICRO-45 Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture
Efficient virtual memory for big memory servers

Proceedings of the 40th Annual International Symposium on Computer Architecture
Large-reach memory management unit caches

Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture
Architectural support for address translation on GPUs: designing memory management units for CPU/GPUs with unified address spaces

Proceedings of the 19th international conference on Architectural support for programming languages and operating systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

The importance of the Translation Lookaside Buffer (TLB) on system performance is well known. There have been numerous prior efforts addressing TLB design issues for cutting down access times and lowering miss rates. However, it was only recently that the first exploration [26] on prefetching TLB entries ahead of their need was undertaken and a mechanism called Recency Prefetching was proposed. There is a large body of literature on prefetching for caches, and it is not clear how they can be adapted (or if the issues are different) for TLBs, how well suited they are for TLB prefetching, and how they compare with the recency prefetching mechanism.This paper presents the first detailed comparison of different prefetching mechanisms (previously proposed for caches) - arbitrary stride prefetching, and markov prefetching - for TLB entries, and evaluates their pros and cons. In addition, this paper proposes a novel prefetching mechanism, called Distance Prefetching, that attempts to capture patterns in the reference behavior in a smaller space than earlier proposals. Using detailed simulations of a wide variety of applications (56 in all) from different benchmark suites and all the SPEC CPU2000 applications, this paper demonstrates the benefits of distance prefetching.