Architectural support for operating system-driven CMP cache management

Authors:
Nauman Rafique;Won-Taek Lim;Mithuna Thottethodi
Affiliations:
Purdue University, West Lafayette, IN;Purdue University, West Lafayette, IN;Purdue University, West Lafayette, IN
Venue:
Proceedings of the 15th international conference on Parallel architectures and compilation techniques
Year:
2006

Citing 22
Cited 72

The VMP multiprocessor: initial experience, refinements, and performance evaluation

ISCA '88 Proceedings of the 15th Annual International Symposium on Computer architecture
Hardware-software trade-offs in a direct Rambus implementation of the RAMpage memory hierarchy

Proceedings of the eighth international conference on Architectural support for programming languages and operating systems
A fully associative software-managed cache design

Proceedings of the 27th annual international symposium on Computer architecture
Operating System Concepts

Operating System Concepts
Simics: A Full System Simulation Platform

Computer
Software-Managed Address Translation

HPCA '97 Proceedings of the 3rd IEEE Symposium on High-Performance Computer Architecture
Generating Representative Web Workloads for Network and Server Performance Evaluation

Generating Representative Web Workloads for Network and Server Performance Evaluation
A New Memory Monitoring Scheme for Memory-Aware Scheduling and Partitioning

HPCA '02 Proceedings of the 8th International Symposium on High-Performance Computer Architecture
Introduction: Service-oriented computing

Communications of the ACM - Service-oriented computing
Dynamic Partitioning of Shared Cache Memory

The Journal of Supercomputing
CQoS: a framework for enabling QoS in shared caches of CMP platforms

Proceedings of the 18th annual international conference on Supercomputing
Utility computing SLA management based upon business objectives

IBM Systems Journal
Fair Cache Sharing and Partitioning in a Chip Multiprocessor Architecture

Proceedings of the 13th International Conference on Parallel Architectures and Compilation Techniques
SLA based profit optimization in autonomic computing systems

Proceedings of the 2nd international conference on Service oriented computing
A Performance Comparison of DRAM Memory System Optimizations for SMT Processors

HPCA '05 Proceedings of the 11th International Symposium on High-Performance Computer Architecture
Enterprise IT Trends and Implications for Architecture Research

HPCA '05 Proceedings of the 11th International Symposium on High-Performance Computer Architecture
Niagara: A 32-Way Multithreaded Sparc Processor

IEEE Micro
The V-Way Cache: Demand Based Associativity via Global Replacement

Proceedings of the 32nd annual international symposium on Computer Architecture
Organizing the Last Line of Defense before Hitting the Memory Wall for CMPs

HPCA '04 Proceedings of the 10th International Symposium on High Performance Computer Architecture
Fast and fair: data-stream quality of service

Proceedings of the 2005 international conference on Compilers, architectures and synthesis for embedded systems
The Future of Microprocessors

Queue - Multiprocessors
Multifacet's general execution-driven multiprocessor simulator (GEMS) toolset

ACM SIGARCH Computer Architecture News - Special issue: dasCMP'05

From chaos to QoS: case studies in CMP resource management

ACM SIGARCH Computer Architecture News
Virtual hierarchies to support server consolidation

Proceedings of the 34th annual international symposium on Computer architecture
Virtual private caches

Proceedings of the 34th annual international symposium on Computer architecture
QoS policies and architecture for cache/memory in CMP platforms

Proceedings of the 2007 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Cooperative cache partitioning for chip multiprocessors

Proceedings of the 21st annual international conference on Supercomputing
Adaptive set pinning: managing shared caches in chip multiprocessors

Proceedings of the 13th international conference on Architectural support for programming languages and operating systems
Memory performance attacks: denial of memory service in multi-core systems

SS'07 Proceedings of 16th USENIX Security Symposium on USENIX Security Symposium
Memory hierarchy performance measurement of commercial dual-core desktop processors

Journal of Systems Architecture: the EUROMICRO Journal
A novel migration-based NUCA design for chip multiprocessors

Proceedings of the 2008 ACM/IEEE conference on Supercomputing
PAM: a novel performance/power aware meta-scheduler for multi-core systems

Proceedings of the 2008 ACM/IEEE conference on Supercomputing
Analysis and approximation of optimal co-scheduling on chip multiprocessors

Proceedings of the 17th international conference on Parallel architectures and compilation techniques
RapidMRC: approximating L2 miss rate curves on commodity systems for online optimizations

Proceedings of the 14th international conference on Architectural support for programming languages and operating systems
Towards practical page coloring-based multicore cache management

Proceedings of the 4th ACM European conference on Computer systems
Coordinated management of multiple interacting resources in chip multiprocessors: A machine learning approach

Proceedings of the 41st annual IEEE/ACM International Symposium on Microarchitecture
A light-weight fairness mechanism for chip multiprocessor memory systems

Proceedings of the 6th ACM conference on Computing frontiers
A study on optimally co-scheduling jobs of different lengths on chip multiprocessors

Proceedings of the 6th ACM conference on Computing frontiers
Enhancing operating system support for multicore processors by using hardware performance monitoring

ACM SIGOPS Operating Systems Review
FlexDCP: a QoS framework for CMP architectures

ACM SIGOPS Operating Systems Review
Rate-based QoS techniques for cache/memory in CMP platforms

Proceedings of the 23rd international conference on Supercomputing
Service level agreement for multithreaded processors

ACM Transactions on Architecture and Code Optimization (TACO)
PIPP: promotion/insertion pseudo-partitioning of multi-core shared caches

Proceedings of the 36th annual international symposium on Computer architecture
A case for integrated processor-cache partitioning in chip multiprocessors

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Enabling software management for multicore caches with a lightweight hardware support

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
VM3: Measuring, modeling and managing VM shared resources

Computer Networks: The International Journal of Computer and Telecommunications Networking
Extending the effectiveness of 3D-stacked DRAM caches with an adaptive multi-queue policy

Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture
Optimizing shared cache behavior of chip multiprocessors

Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture
SHARP control: controlled shared cache management in chip multiprocessors

Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture
Does cache sharing on modern CMP matter to the performance of contemporary multithreaded programs?

Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
Micro-pages: increasing DRAM efficiency with locality-aware data placement

Proceedings of the fifteenth edition of ASPLOS on Architectural support for programming languages and operating systems
COMPASS: a programmable data prefetcher using idle GPU shaders

Proceedings of the fifteenth edition of ASPLOS on Architectural support for programming languages and operating systems
Contention aware execution: online contention detection and response

Proceedings of the 8th annual IEEE/ACM international symposium on Code generation and optimization
PIRATE: QoS and performance management in CMP architectures

ACM SIGMETRICS Performance Evaluation Review
MLP-aware dynamic cache partitioning

HiPEAC'08 Proceedings of the 3rd international conference on High performance embedded architectures and compilers
Synthesizing contention

Proceedings of the Workshop on Binary Instrumentation and Applications
Cache topology aware computation mapping for multicores

PLDI '10 Proceedings of the 2010 ACM SIGPLAN conference on Programming language design and implementation
ScaleUPC: a UPC compiler for multi-core systems

Proceedings of the Third Conference on Partitioned Global Address Space Programing Models
Off-chip memory bandwidth minimization through cache partitioning for multi-core platforms

Proceedings of the 47th Design Automation Conference
Handling the problems and opportunities posed by multiple on-chip memory controllers

Proceedings of the 19th international conference on Parallel architectures and compilation techniques
Quality of service shared cache management in chip multiprocessor architecture

ACM Transactions on Architecture and Code Optimization (TACO)
Simple but Effective Heterogeneous Main Memory with On-Chip Memory Controller Support

Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
Online cache modeling for commodity multicore processors

ACM SIGOPS Operating Systems Review
Directly characterizing cross core interference through contention synthesis

Proceedings of the 6th International Conference on High Performance and Embedded Architectures and Compilers
Replacement policies for shared caches on symmetric multicores: a programmer-centric point of view

Proceedings of the 6th International Conference on High Performance and Embedded Architectures and Compilers
Adaptive timekeeping replacement: Fine-grained capacity management for shared CMP caches

ACM Transactions on Architecture and Code Optimization (TACO)
CoQoS: Coordinating QoS-aware shared resources in NoC-based SoCs

Journal of Parallel and Distributed Computing
Dynamic cache partitioning based on the MLP of cache misses

Transactions on high-performance embedded architectures and compilers III
METE: meeting end-to-end QoS in multicores through system-wide resource management

Proceedings of the ACM SIGMETRICS joint international conference on Measurement and modeling of computer systems
Studying inter-core data reuse in multicores

Proceedings of the ACM SIGMETRICS joint international conference on Measurement and modeling of computer systems
Prefetch-aware shared resource management for multi-core systems

Proceedings of the 38th annual international symposium on Computer architecture
The impact of memory subsystem resource sharing on datacenter applications

Proceedings of the 38th annual international symposium on Computer architecture
Contentiousness vs. sensitivity: improving contention aware runtime systems on multicore architectures

Proceedings of the 1st International Workshop on Adaptive Self-Tuning Computing Systems for the Exaflop Era
Loaf: a framework and infrastructure for creating online adaptive solutions

Proceedings of the 1st International Workshop on Adaptive Self-Tuning Computing Systems for the Exaflop Era
METE: meeting end-to-end QoS in multicores through system-wide resource management

ACM SIGMETRICS Performance Evaluation Review - Performance evaluation review
Studying inter-core data reuse in multicores

ACM SIGMETRICS Performance Evaluation Review - Performance evaluation review
Evaluating placement policies for managing capacity sharing in CMP architectures with private caches

ACM Transactions on Architecture and Code Optimization (TACO)
A helper thread based dynamic cache partitioning scheme for multithreaded applications

Proceedings of the 48th Design Automation Conference
Multilayer cache partitioning for multiprogram workloads

Euro-Par'11 Proceedings of the 17th international conference on Parallel processing - Volume Part I
Cuanta: quantifying effects of shared on-chip resource interference for consolidated virtual machines

Proceedings of the 2nd ACM Symposium on Cloud Computing
The gradient-based cache partitioning algorithm

ACM Transactions on Architecture and Code Optimization (TACO) - HIPEAC Papers
Improving shared cache behavior of multithreaded object-oriented applications in multicores

Proceedings of the International Conference on Computer-Aided Design
Bubble-Up: increasing utilization in modern warehouse scale computers via sensible co-locations

Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture
Combining locality analysis with online proactive job co-scheduling in chip multiprocessors

HiPEAC'10 Proceedings of the 5th international conference on High Performance Embedded Architectures and Compilers
Scalable shared-cache management by containing thrashing workloads

HiPEAC'10 Proceedings of the 5th international conference on High Performance Embedded Architectures and Compilers
SRP: symbiotic resource partitioning of the memory hierarchy in CMPs

HiPEAC'10 Proceedings of the 5th international conference on High Performance Embedded Architectures and Compilers
Is reuse distance applicable to data locality analysis on chip multiprocessors?

CC'10/ETAPS'10 Proceedings of the 19th joint European conference on Theory and Practice of Software, international conference on Compiler Construction
Courteous cache sharing: being nice to others in capacity management

Proceedings of the 49th Annual Design Automation Conference
Probabilistic shared cache management (PriSM)

Proceedings of the 39th Annual International Symposium on Computer Architecture
Dynamic QoS management for chip multiprocessors

ACM Transactions on Architecture and Code Optimization (TACO)
A collaborative memory system for high-performance and cost-effective clustered architectures

Proceedings of the 1st Workshop on Architectures and Systems for Big Data
Survey of scheduling techniques for addressing shared resources in multicore processors

ACM Computing Surveys (CSUR)
A Machine Learning Based Meta-Scheduler for Multi-Core Processors

International Journal of Adaptive, Resilient and Autonomic Systems
PCASA: probabilistic control-adjusted selective allocation for shared caches

DATE '12 Proceedings of the Conference on Design, Automation and Test in Europe

Quantified Score

Hi-index	0.00

Visualization

Abstract

The role of the operating system (OS) in managing shared resources such as CPU time, memory, peripherals, and even energy is well motivated and understood [23]. Unfortunately, one key resource—lower-level shared cache in chip multi-processors—is commonly managed purely in hardware by rudimentary replacement policies such as least-recentlyused (LRU). The rigid nature of the hardware cache management policy poses a serious problem since there is no single best cache management policy across all sharing scenarios. For example, the cache management policy for a scenario where applications from a single organization are running under "best effort" performance expectation is likely to be different from the policy for a scenario where applications from competing business entities (say, at a third party data center) are running under a minimum service level expectation. When it comes to managing shared caches, there is an inherent tension between flexibility and performance. On one hand, managing the shared cache in the OS offers immense policy flexibility since it may be implemented in software. Unfortunately, it is prohibitively expensive in terms of performance for the OS to be involved in managing temporally fine-grain events such as cache allocation. On the other hand, sophisticated hardware-only cache management techniques to achieve fair sharing or throughput maximization have been proposed. But they offer no policy flexibility.This paper addresses this problem by designing architectural support for OS to efficiently manage shared caches with a wide variety of policies. Our scheme consists of a hardware cache quota management mechanism, an OS interface and a set of OS level quota orchestration policies. The hardware mechanism guarantees that OS-specified quotas are enforced in shared caches, thus eliminating the need for (and the performance penalty of) temporally fine-grained OS intervention. The OS retains policy flexibility since it can tune the quotas during regularly scheduled OS interventions. We demonstrate that our scheme can support a wide range of policies including policies that provide (a) passive performance differentiation, (b) reactive fairness by miss-rate equalization and (c) reactive performance differentiation.