Multiprocessor cache analysis using ATUM
ISCA '88 Proceedings of the 15th Annual International Symposium on Computer architecture
Page placement algorithms for large real-indexed caches
ACM Transactions on Computer Systems (TOCS)
The effect of page allocation on caches
MICRO 25 Proceedings of the 25th annual international symposium on Microarchitecture
Software-based cache partitioning for real-time applications
Journal of Computer and Software Engineering - Special issue: hardware-software codesign
Compiler support for software-based cache partitioning
LCTES '95 Proceedings of the ACM SIGPLAN 1995 workshop on Languages, compilers, & tools for real-time systems
A data cache with multiple caching strategies tuned to different types of locality
ICS '95 Proceedings of the 9th international conference on Supercomputing
A modified approach to data cache management
Proceedings of the 28th annual international symposium on Microarchitecture
Compiler-directed page coloring for multiprocessors
Proceedings of the seventh international conference on Architectural support for programming languages and operating systems
Continuous profiling: where have all the cycles gone?
ACM Transactions on Computer Systems (TOCS)
ProfileMe: hardware support for instruction-level profiling on out-of-order processors
MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
Reducing cache misses using hardware and software page placement
ICS '99 Proceedings of the 13th international conference on Supercomputing
IEEE Transactions on Computers
Compiler managed micro-cache bypassing for high performance EPIC processors
Proceedings of the 35th annual ACM/IEEE international symposium on Microarchitecture
OS-Controlled Cache Predictability for Real-Time Systems
RTAS '97 Proceedings of the 3rd IEEE Real-Time Technology and Applications Symposium (RTAS '97)
Data Caches in Multitasking Hard Real-Time Systems
RTSS '03 Proceedings of the 24th IEEE International Real-Time Systems Symposium
Dynamic Partitioning of Shared Cache Memory
The Journal of Supercomputing
Predicting Inter-Thread Cache Contention on a Chip Multi-Processor Architecture
HPCA '05 Proceedings of the 11th International Symposium on High-Performance Computer Architecture
Fast data-locality profiling of native execution
SIGMETRICS '05 Proceedings of the 2005 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Managing Distributed, Shared L2 Caches through OS-Level Page Allocation
Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
Adaptive insertion policies for high performance caching
Proceedings of the 34th annual international symposium on Computer architecture
A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy
Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture
Counter-Based Cache Replacement and Bypassing Algorithms
IEEE Transactions on Computers
Exploiting Single-Usage for Effective Memory Management
ACSAC '07 Proceedings of the 12th Asia-Pacific conference on Advances in Computer Systems Architecture
Enhancing last-level cache performance by block bypassing and early miss determination
ACSAC'06 Proceedings of the 11th Asia-Pacific conference on Advances in Computer Systems Architecture
RapidMRC: approximating L2 miss rate curves on commodity systems for online optimizations
Proceedings of the 14th international conference on Architectural support for programming languages and operating systems
Towards practical page coloring-based multicore cache management
Proceedings of the 4th ACM European conference on Computer systems
Enhancing operating system support for multicore processors by using hardware performance monitoring
ACM SIGOPS Operating Systems Review
MCC-DB: minimizing cache conflicts in multi-core processors for databases
Proceedings of the VLDB Endowment
Contention aware execution: online contention detection and response
Proceedings of the 8th annual IEEE/ACM international symposium on Code generation and optimization
Proceedings of the Workshop on Binary Instrumentation and Applications
Hardware execution throttling for multi-core resource management
USENIX'09 Proceedings of the 2009 conference on USENIX Annual technical conference
Generalized ERSS tree model: Revisiting working sets
Performance Evaluation
STEM: Spatiotemporal Management of Capacity for Intra-core Last Level Caches
MICRO '43 Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture
ULCC: a user-level facility for optimizing shared cache performance on multicores
Proceedings of the 16th ACM symposium on Principles and practice of parallel programming
Directly characterizing cross core interference through contention synthesis
Proceedings of the 6th International Conference on High Performance and Embedded Architectures and Compilers
Proceedings of the sixth conference on Computer systems
Mind the gap: reconnecting architecture and OS research
HotOS'13 Proceedings of the 13th USENIX conference on Hot topics in operating systems
Controlling cache utilization of HPC applications
Proceedings of the international conference on Supercomputing
Proceedings of the 1st International Workshop on Adaptive Self-Tuning Computing Systems for the Exaflop Era
Loaf: a framework and infrastructure for creating online adaptive solutions
Proceedings of the 1st International Workshop on Adaptive Self-Tuning Computing Systems for the Exaflop Era
Page coloring synchronization for improving cache performance in virtualization environment
ICCSA'11 Proceedings of the 2011 international conference on Computational science and its applications - Volume Part III
Proceedings of the 2nd ACM Symposium on Cloud Computing
Improving shared cache behavior of multithreaded object-oriented applications in multicores
Proceedings of the International Conference on Computer-Aided Design
Region scheduling: efficiently using the cache architectures via page-level affinity
ASPLOS XVII Proceedings of the seventeenth international conference on Architectural Support for Programming Languages and Operating Systems
Code-based cache partitioning for improving hardware cache performance
Proceedings of the 6th International Conference on Ubiquitous Information Management and Communication
Affinity-aware DMA buffer management for reducing off-chip memory access
Proceedings of the 27th Annual ACM Symposium on Applied Computing
Compiling for niceness: mitigating contention for QoS in warehouse scale computers
Proceedings of the Tenth International Symposium on Code Generation and Optimization
Reducing last level cache pollution in NUMA multicore systems for improving cache performance
ICCSA'12 Proceedings of the 12th international conference on Computational Science and Its Applications - Volume Part III
Survey of scheduling techniques for addressing shared resources in multicore processors
ACM Computing Surveys (CSUR)
To hardware prefetch or not to prefetch?: a virtualized environment study and core binding approach
Proceedings of the eighteenth international conference on Architectural support for programming languages and operating systems
ACM Transactions on Architecture and Code Optimization (TACO)
Hi-index | 0.00 |
It is well recognized that LRU cache-line replacement can be ineffective for applications with large working sets or non-localized memory access patterns. Specifically, in last-level processor caches, LRU can cause cache pollution by inserting non-reuseable elements into the cache while evicting reusable ones. The work presented in this paper addresses last-level cache pollution through a dynamic operating system mechanism, called ROCS, requiring no change to underlying hardware and no change to applications. ROCS employs hardware performance counters on a commodity processor to characterize application cache behavior at run-time. Using this online profiling, cache unfriendly pages are dynamically mapped to a pollute buffer in the cache, eliminating competition between reusable and non-reusable cache lines. The operating system implements the pollute buffer through a page-coloring based technique, by dedicating a small slice of the last-level cache to store non-reusable pages. Measurements show that ROCS, implemented in the Linux 2.6.24 kernel and running on a 2.3GHz PowerPC 970FX, improves performance of memory-intensive SPEC CPU 2000 and NAS benchmarks by up to 34%, and 16% on average.