The impact of architectural trends on operating system performance
SOSP '95 Proceedings of the fifteenth ACM symposium on Operating systems principles
Prefetching using Markov predictors
Proceedings of the 24th annual international symposium on Computer architecture
Options for dynamic address translation in COMAs
Proceedings of the 25th annual international symposium on Computer architecture
A look at several memory management units, TLB-refill mechanisms, and page table organizations
Proceedings of the eighth international conference on Architectural support for programming languages and operating systems
Proceedings of the 27th annual international symposium on Computer architecture
Characterizing the d-TLB behavior of SPEC CPU2000 benchmarks
SIGMETRICS '02 Proceedings of the 2002 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Going the distance for TLB prefetching: an application-driven study
ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
Effective Hardware-Based Data Prefetching for High-Performance Processors
IEEE Transactions on Computers
IEEE Transactions on Computers
Multifacet's general execution-driven multiprocessor simulator (GEMS) toolset
ACM SIGARCH Computer Architecture News - Special issue: dasCMP'05
Techniques for Multicore Thermal Management: Classification and New Exploration
Proceedings of the 33rd annual international symposium on Computer Architecture
Subsetting the SPEC CPU2006 benchmark suite
ACM SIGARCH Computer Architecture News
SPEC CPU2006 sensitivity to memory page sizes
ACM SIGARCH Computer Architecture News
Thermal-aware task scheduling at the system software level
ISLPED '07 Proceedings of the 2007 international symposium on Low power electronics and design
The PARSEC benchmark suite: characterization and architectural implications
Proceedings of the 17th international conference on Parallel architectures and compilation techniques
Thread motion: fine-grained power management for multi-core systems
Proceedings of the 36th annual international symposium on Computer architecture
Characterizing the TLB Behavior of Emerging Parallel Workloads on Chip Multiprocessors
PACT '09 Proceedings of the 2009 18th International Conference on Parallel Architectures and Compilation Techniques
TMT - A TLB Tag Management Framework for Virtualized Platforms
SBAC-PAD '09 Proceedings of the 2009 21st International Symposium on Computer Architecture and High Performance Computing
qTLB: looking inside the look-aside buffer
HiPC'07 Proceedings of the 14th international conference on High performance computing
Synergistic TLBs for High Performance Address Translation in Chip Multiprocessors
MICRO '43 Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture
DiDi: Mitigating the Performance Impact of TLB Shootdowns Using a Shared TLB Directory
PACT '11 Proceedings of the 2011 International Conference on Parallel Architectures and Compilation Techniques
Efficient virtual memory for big memory servers
Proceedings of the 40th Annual International Symposium on Computer Architecture
Hi-index | 0.00 |
Translation Lookaside Buffers (TLBs) are critical to overall system performance. Much past research has addressed uniprocessor TLBs, lowering access times and miss rates. However, as Chip MultiProcessors (CMPs) become ubiquitous, TLB design and performance must be reevaluated. Our article begins by performing a thorough TLB performance evaluation of sequential and parallel benchmarks running on a real-world, modern CMP system using hardware performance counters. This analysis demonstrates the need for further improvement of TLB hit rates for both classes of application, and it also points out that the data TLB has a significantly higher miss rate than the instruction TLB in both cases. In response to the characterization data, we propose and evaluate both Inter-Core Cooperative (ICC) TLB prefetchers and Shared Last-Level (SLL) TLBs as alternatives to the commercial norm of private, per-core L2 TLBs. ICC prefetchers eliminate 19% to 90% of Data TLB (D-TLB) misses across parallel workloads while requiring only modest changes in hardware. SLL TLBs eliminate 7% to 79% of D-TLB misses for parallel workloads and 35% to 95% of D-TLB misses for multiprogrammed sequential workloads. This corresponds to 27% and 21% increases in hit rates as compared to private, per-core L2 TLBs, respectively, and is achieved this using even more modest hardware requirements. Because of their benefits for parallel applications, their applicability to sequential workloads, and their readily implementable hardware, SLL TLBs and ICC TLB prefetchers hold great promise for CMPs.