Reducing the harmful effects of last-level cache polluters with an OS-level, software-only pollute buffer

Authors:
Livio Soares;David Tam;Michael Stumm
Affiliations:
Department of Electrical and Computer Engineering, University of Toronto, Canada;Department of Electrical and Computer Engineering, University of Toronto, Canada;Department of Electrical and Computer Engineering, University of Toronto, Canada
Venue:
Proceedings of the 41st annual IEEE/ACM International Symposium on Microarchitecture
Year:
2008

Citing 24
Cited 28

Multiprocessor cache analysis using ATUM

ISCA '88 Proceedings of the 15th Annual International Symposium on Computer architecture
Page placement algorithms for large real-indexed caches

ACM Transactions on Computer Systems (TOCS)
The effect of page allocation on caches

MICRO 25 Proceedings of the 25th annual international symposium on Microarchitecture
Software-based cache partitioning for real-time applications

Journal of Computer and Software Engineering - Special issue: hardware-software codesign
Compiler support for software-based cache partitioning

LCTES '95 Proceedings of the ACM SIGPLAN 1995 workshop on Languages, compilers, & tools for real-time systems
A data cache with multiple caching strategies tuned to different types of locality

ICS '95 Proceedings of the 9th international conference on Supercomputing
A modified approach to data cache management

Proceedings of the 28th annual international symposium on Microarchitecture
Compiler-directed page coloring for multiprocessors

Proceedings of the seventh international conference on Architectural support for programming languages and operating systems
Continuous profiling: where have all the cycles gone?

ACM Transactions on Computer Systems (TOCS)
ProfileMe: hardware support for instruction-level profiling on out-of-order processors

MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
Reducing cache misses using hardware and software page placement

ICS '99 Proceedings of the 13th international conference on Supercomputing
Run-Time Cache Bypassing

IEEE Transactions on Computers
Compiler managed micro-cache bypassing for high performance EPIC processors

Proceedings of the 35th annual ACM/IEEE international symposium on Microarchitecture
OS-Controlled Cache Predictability for Real-Time Systems

RTAS '97 Proceedings of the 3rd IEEE Real-Time Technology and Applications Symposium (RTAS '97)
Data Caches in Multitasking Hard Real-Time Systems

RTSS '03 Proceedings of the 24th IEEE International Real-Time Systems Symposium
Dynamic Partitioning of Shared Cache Memory

The Journal of Supercomputing
Predicting Inter-Thread Cache Contention on a Chip Multi-Processor Architecture

HPCA '05 Proceedings of the 11th International Symposium on High-Performance Computer Architecture
Fast data-locality profiling of native execution

SIGMETRICS '05 Proceedings of the 2005 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Managing Distributed, Shared L2 Caches through OS-Level Page Allocation

Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
Adaptive insertion policies for high performance caching

Proceedings of the 34th annual international symposium on Computer architecture
A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy

Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture
Counter-Based Cache Replacement and Bypassing Algorithms

IEEE Transactions on Computers
Exploiting Single-Usage for Effective Memory Management

ACSAC '07 Proceedings of the 12th Asia-Pacific conference on Advances in Computer Systems Architecture
Enhancing last-level cache performance by block bypassing and early miss determination

ACSAC'06 Proceedings of the 11th Asia-Pacific conference on Advances in Computer Systems Architecture

RapidMRC: approximating L2 miss rate curves on commodity systems for online optimizations

Proceedings of the 14th international conference on Architectural support for programming languages and operating systems
Towards practical page coloring-based multicore cache management

Proceedings of the 4th ACM European conference on Computer systems
Enhancing operating system support for multicore processors by using hardware performance monitoring

ACM SIGOPS Operating Systems Review
MCC-DB: minimizing cache conflicts in multi-core processors for databases

Proceedings of the VLDB Endowment
Contention aware execution: online contention detection and response

Proceedings of the 8th annual IEEE/ACM international symposium on Code generation and optimization
Synthesizing contention

Proceedings of the Workshop on Binary Instrumentation and Applications
Hardware execution throttling for multi-core resource management

USENIX'09 Proceedings of the 2009 conference on USENIX Annual technical conference
Generalized ERSS tree model: Revisiting working sets

Performance Evaluation
STEM: Spatiotemporal Management of Capacity for Intra-core Last Level Caches

MICRO '43 Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture
ULCC: a user-level facility for optimizing shared cache performance on multicores

Proceedings of the 16th ACM symposium on Principles and practice of parallel programming
Directly characterizing cross core interference through contention synthesis

Proceedings of the 6th International Conference on High Performance and Embedded Architectures and Compilers
SRM-buffer: an OS buffer management technique to prevent last level cache from thrashing in multicores

Proceedings of the sixth conference on Computer systems
Mind the gap: reconnecting architecture and OS research

HotOS'13 Proceedings of the 13th USENIX conference on Hot topics in operating systems
Controlling cache utilization of HPC applications

Proceedings of the international conference on Supercomputing
Contentiousness vs. sensitivity: improving contention aware runtime systems on multicore architectures

Proceedings of the 1st International Workshop on Adaptive Self-Tuning Computing Systems for the Exaflop Era
Loaf: a framework and infrastructure for creating online adaptive solutions

Proceedings of the 1st International Workshop on Adaptive Self-Tuning Computing Systems for the Exaflop Era
Page coloring synchronization for improving cache performance in virtualization environment

ICCSA'11 Proceedings of the 2011 international conference on Computational science and its applications - Volume Part III
Cuanta: quantifying effects of shared on-chip resource interference for consolidated virtual machines

Proceedings of the 2nd ACM Symposium on Cloud Computing
Improving shared cache behavior of multithreaded object-oriented applications in multicores

Proceedings of the International Conference on Computer-Aided Design
Region scheduling: efficiently using the cache architectures via page-level affinity

ASPLOS XVII Proceedings of the seventeenth international conference on Architectural Support for Programming Languages and Operating Systems
Code-based cache partitioning for improving hardware cache performance

Proceedings of the 6th International Conference on Ubiquitous Information Management and Communication
Affinity-aware DMA buffer management for reducing off-chip memory access

Proceedings of the 27th Annual ACM Symposium on Applied Computing
Compiling for niceness: mitigating contention for QoS in warehouse scale computers

Proceedings of the Tenth International Symposium on Code Generation and Optimization
Reducing last level cache pollution in NUMA multicore systems for improving cache performance

ICCSA'12 Proceedings of the 12th international conference on Computational Science and Its Applications - Volume Part III
Survey of scheduling techniques for addressing shared resources in multicore processors

ACM Computing Surveys (CSUR)
Accurate prediction of the behavior of multithreaded applications in shared caches

Parallel Computing
To hardware prefetch or not to prefetch?: a virtualized environment study and core binding approach

Proceedings of the eighteenth international conference on Architectural support for programming languages and operating systems
ReSense: Mapping dynamic workloads of colocated multithreaded applications using resource sensitivity

ACM Transactions on Architecture and Code Optimization (TACO)

Quantified Score

Hi-index	0.00

Visualization

Abstract

It is well recognized that LRU cache-line replacement can be ineffective for applications with large working sets or non-localized memory access patterns. Specifically, in last-level processor caches, LRU can cause cache pollution by inserting non-reuseable elements into the cache while evicting reusable ones. The work presented in this paper addresses last-level cache pollution through a dynamic operating system mechanism, called ROCS, requiring no change to underlying hardware and no change to applications. ROCS employs hardware performance counters on a commodity processor to characterize application cache behavior at run-time. Using this online profiling, cache unfriendly pages are dynamically mapped to a pollute buffer in the cache, eliminating competition between reusable and non-reusable cache lines. The operating system implements the pollute buffer through a page-coloring based technique, by dedicating a small slice of the last-level cache to store non-reusable pages. Measurements show that ROCS, implemented in the Linux 2.6.24 kernel and running on a 2.3GHz PowerPC 970FX, improves performance of memory-intensive SPEC CPU 2000 and NAS benchmarks by up to 34%, and 16% on average.