A simulation based study of TLB performance
ISCA '92 Proceedings of the 19th annual international symposium on Computer architecture
Design tradeoffs for software-managed TLBs
ACM Transactions on Computer Systems (TOCS)
Surpassing the TLB performance of superpages with less operating system support
ASPLOS VI Proceedings of the sixth international conference on Architectural support for programming languages and operating systems
High-bandwidth address translation for multiple-issue processors
ISCA '96 Proceedings of the 23rd annual international symposium on Computer architecture
Options for dynamic address translation in COMAs
Proceedings of the 25th annual international symposium on Computer architecture
A look at several memory management units, TLB-refill mechanisms, and page table organizations
Proceedings of the eighth international conference on Architectural support for programming languages and operating systems
Proceedings of the 27th annual international symposium on Computer architecture
Characterizing the d-TLB behavior of SPEC CPU2000 benchmarks
SIGMETRICS '02 Proceedings of the 2002 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Going the distance for TLB prefetching: an application-driven study
ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
Software-Managed Address Translation
HPCA '97 Proceedings of the 3rd IEEE Symposium on High-Performance Computer Architecture
Fair Cache Sharing and Partitioning in a Chip Multiprocessor Architecture
Proceedings of the 13th International Conference on Parallel Architectures and Compilation Techniques
Victim Replication: Maximizing Capacity while Hiding Wire Delay in Tiled Chip Multiprocessors
Proceedings of the 32nd annual international symposium on Computer Architecture
Optimizing Replication, Communication, and Capacity Allocation in CMPs
Proceedings of the 32nd annual international symposium on Computer Architecture
Multifacet's general execution-driven multiprocessor simulator (GEMS) toolset
ACM SIGARCH Computer Architecture News - Special issue: dasCMP'05
Cooperative Caching for Chip Multiprocessors
Proceedings of the 33rd annual international symposium on Computer Architecture
Proceedings of the 2006 workshop on Memory system performance and correctness
SPEC CPU2006 benchmark descriptions
ACM SIGARCH Computer Architecture News
Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
ASR: Adaptive Selective Replication for CMP Caches
Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
Adaptive insertion policies for high performance caching
Proceedings of the 34th annual international symposium on Computer architecture
Accelerating two-dimensional page walks for virtualized systems
Proceedings of the 13th international conference on Architectural support for programming languages and operating systems
The PARSEC benchmark suite: characterization and architectural implications
Proceedings of the 17th international conference on Parallel architectures and compilation techniques
Leveraging on-chip networks for data cache migration in chip multiprocessors
Proceedings of the 17th international conference on Parallel architectures and compilation techniques
Adaptive insertion policies for managing shared caches
Proceedings of the 17th international conference on Parallel architectures and compilation techniques
Investigating the TLB Behavior of High-end Scientific Applications on Commodity Microprocessors
ISPASS '08 Proceedings of the ISPASS 2008 - IEEE International Symposium on Performance Analysis of Systems and software
Characterizing the TLB Behavior of Emerging Parallel Workloads on Chip Multiprocessors
PACT '09 Proceedings of the 2009 18th International Conference on Parallel Architectures and Compilation Techniques
Inter-core cooperative TLB for chip multiprocessors
Proceedings of the fifteenth edition of ASPLOS on Architectural support for programming languages and operating systems
qTLB: looking inside the look-aside buffer
HiPC'07 Proceedings of the 14th international conference on High performance computing
Revisiting hardware-assisted page walks for virtualized systems
Proceedings of the 39th Annual International Symposium on Computer Architecture
ACM Transactions on Architecture and Code Optimization (TACO) - Special Issue on High-Performance Embedded Architectures and Compilers
ACM Transactions on Architecture and Code Optimization (TACO)
Efficient virtual memory for big memory servers
Proceedings of the 40th Annual International Symposium on Computer Architecture
Hi-index | 0.00 |
Translation Look-aside Buffers (TLBs) are vital hardware support for virtual memory management in high performance computer systems and have a momentous influence on overall system performance. Numerous techniques to reduce TLB miss latencies including the impact of TLB size, associativity, multilevel hierarchies, super pages, and prefetching have been well studied in the context of uniprocessors. However, with Chip Multiprocessors (CMPs) becoming the standard design point of processor architectures, it is imperative that we review the design and organization of TLBs in the context of CMPs. In this paper, we propose to improve system performance by means of a novel way of organizing TLBs called Synergistic TLBs. Synergistic TLB is different from per-core private TLB organization in three ways: (i) it provides capacity sharing of TLBs by facilitating storing of victim translations from one TLB in another to emulate a distributed shared TLB (DST), (ii) it supports translation migration for maximizing the utilization of TLB capacity, and (iii) it supports translation replication to avoid excess latency for remote TLB accesses. We explore all the design points in this design space and find that an optimal point exists for high performance address translation. Our evaluation with both multiprogrammed (SPEC 2006 applications) and multithreaded workloads (PARSEC applications) shows that Synergistic TLBs can eliminate, respectively, 44.3% and 31.2% of the TLB misses, on average. It also improves the weighted speedup of multiprogrammed application mixes by 25.1% and performance of multithreaded applications by 27.3%, on average.