Ubik: efficient cache sharing with strict qos for latency-critical workloads

Authors:
Harshad Kasture;Daniel Sanchez
Affiliations:
Massachusetts Institute of Technology, Cambridge, MA, USA;Massachusetts Institute of Technology, Cambridge, MA, USA
Venue:
Proceedings of the 19th international conference on Architectural support for programming languages and operating systems
Year:
2014

Citing 52
Cited 0

An analytical cache model

ACM Transactions on Computer Systems (TOCS)
A case for two-way skewed-associative caches

ISCA '93 Proceedings of the 20th annual international symposium on computer architecture
Application-specific memory management for embedded systems using software-controlled caches

Proceedings of the 37th Annual Design Automation Conference
Transient behavior of cache memories

ACM Transactions on Computer Systems (TOCS)
Symbiotic jobscheduling for a simultaneous multithreaded processor

ASPLOS IX Proceedings of the ninth international conference on Architectural support for programming languages and operating systems
Data Caches in Multitasking Hard Real-Time Systems

RTSS '03 Proceedings of the 24th IEEE International Real-Time Systems Symposium
IPC Considered Harmful for Multiprocessor Workloads

IEEE Micro
A performance counter architecture for computing accurate CPI components

Proceedings of the 12th international conference on Architectural support for programming languages and operating systems
Fair Queuing Memory Systems

Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
Utility-Based Cache Partitioning: A Low-Overhead, High-Performance, Runtime Mechanism to Partition Shared Caches

Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
Virtual private caches

Proceedings of the 34th annual international symposium on Computer architecture
QoS policies and architecture for cache/memory in CMP platforms

Proceedings of the 2007 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Scratchpad memories vs locked caches in hard real-time systems: a quantitative comparison

Proceedings of the conference on Design, automation and test in Europe
Eliminating receive livelock in an interrupt-driven kernel

ATEC '96 Proceedings of the 1996 annual conference on USENIX Annual Technical Conference
A Framework for Providing Quality of Service in Chip Multi-Processors

Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture
The Case for Energy-Proportional Computing

Computer
Impact of Cache Partitioning on Multi-tasking Real Time Embedded Systems

RTCSA '08 Proceedings of the 2008 14th IEEE International Conference on Embedded and Real-Time Computing Systems and Applications
PowerNap: eliminating server idle power

Proceedings of the 14th international conference on Architectural support for programming languages and operating systems
Shore-MT: a scalable storage manager for the multicore era

Proceedings of the 12th International Conference on Extending Database Technology: Advances in Database Technology
FlexDCP: a QoS framework for CMP architectures

ACM SIGOPS Operating Systems Review
Time-predictable computer architecture

EURASIP Journal on Embedded Systems - FPGA supercomputing platforms, architectures, and techniques for accelerating computationally complex algorithms
PIPP: promotion/insertion pseudo-partitioning of multi-core shared caches

Proceedings of the 36th annual international symposium on Computer architecture
Reactive NUCA: near-optimal block placement and replication in distributed caches

Proceedings of the 36th annual international symposium on Computer architecture
Moses: open source toolkit for statistical machine translation

ACL '07 Proceedings of the 45th Annual Meeting of the ACL on Interactive Poster and Demonstration Sessions
SHARP control: controlled shared cache management in chip multiprocessors

Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture
The case for RAMClouds: scalable high-performance storage entirely in DRAM

ACM SIGOPS Operating Systems Review
Fairness via source throttling: a configurable and high-performance fairness substrate for multi-core memory systems

Proceedings of the fifteenth edition of ASPLOS on Architectural support for programming languages and operating systems
Web search using mobile cores: quantifying and mitigating the price of efficiency

Proceedings of the 37th annual international symposium on Computer architecture
Tessellation: space-time partitioning in a manycore client OS

HotPar'09 Proceedings of the First USENIX conference on Hot topics in parallelism
Hardware execution throttling for multi-core resource management

USENIX'09 Proceedings of the 2009 conference on USENIX Annual technical conference
Thread Cluster Memory Scheduling: Exploiting Differences in Memory Access Behavior

MICRO '43 Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture
The ZCache: Decoupling Ways and Associativity

MICRO '43 Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture
CoQoS: Coordinating QoS-aware shared resources in NoC-based SoCs

Journal of Parallel and Distributed Computing
C4: the continuously concurrent compacting collector

Proceedings of the international symposium on Memory management
METE: meeting end-to-end QoS in multicores through system-wide resource management

Proceedings of the ACM SIGMETRICS joint international conference on Measurement and modeling of computer systems
Vantage: scalable and efficient fine-grain cache partitioning

Proceedings of the 38th annual international symposium on Computer architecture
Kilo-NOC: a heterogeneous network-on-chip architecture for scalability and service guarantees

Proceedings of the 38th annual international symposium on Computer architecture
Clearing the clouds: a study of emerging scale-out workloads on modern hardware

ASPLOS XVII Proceedings of the seventeenth international conference on Architectural Support for Programming Languages and Operating Systems
Bubble-Up: increasing utilization in modern warehouse scale computers via sensible co-locations

Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture
Cache craftiness for fast multicore key-value storage

Proceedings of the 7th ACM european conference on Computer Systems
Network congestion avoidance through Speculative Reservation

HPCA '12 Proceedings of the 2012 IEEE 18th International Symposium on High-Performance Computer Architecture
A QoS-aware memory controller for dynamically balancing GPU and CPU bandwidth use in an MPSoC

Proceedings of the 49th Annual Design Automation Conference
Chronos: predictable low latency for data center applications

Proceedings of the Third ACM Symposium on Cloud Computing
PRETI: partitioned real-time shared cache for mixed-criticality real-time systems

Proceedings of the 20th International Conference on Real-Time and Network Systems
The tail at scale

Communications of the ACM
Paragon: QoS-aware scheduling for heterogeneous datacenters

Proceedings of the eighteenth international conference on Architectural support for programming languages and operating systems
ReQoS: reactive static/dynamic compilation for QoS in warehouse scale computers

Proceedings of the eighteenth international conference on Architectural support for programming languages and operating systems
A hardware evaluation of cache partitioning to improve utilization and energy-efficiency while preserving responsiveness

Proceedings of the 40th Annual International Symposium on Computer Architecture
ZSim: fast and accurate microarchitectural simulation of thousand-core systems

Proceedings of the 40th Annual International Symposium on Computer Architecture
Bubble-flux: precise online QoS management for increased utilization in warehouse scale computers

Proceedings of the 40th Annual International Symposium on Computer Architecture
Power struggles: Revisiting the RISC vs. CISC debate on contemporary ARM and x86 architectures

HPCA '13 Proceedings of the 2013 IEEE 19th International Symposium on High Performance Computer Architecture (HPCA)
Jigsaw: scalable software-defined caches

PACT '13 Proceedings of the 22nd international conference on Parallel architectures and compilation techniques

Quantified Score

Hi-index	0.00

Visualization

Abstract

Chip-multiprocessors (CMPs) must often execute workload mixes with different performance requirements. On one hand, user-facing, latency-critical applications (e.g., web search) need low tail (i.e., worst-case) latencies, often in the millisecond range, and have inherently low utilization. On the other hand, compute-intensive batch applications (e.g., MapReduce) only need high long-term average performance. In current CMPs, latency-critical and batch applications cannot run concurrently due to interference on shared resources. Unfortunately, prior work on quality of service (QoS) in CMPs has focused on guaranteeing average performance, not tail latency. In this work, we analyze several latency-critical workloads, and show that guaranteeing average performance is insufficient to maintain low tail latency, because microarchitectural resources with state, such as caches or cores, exert inertia on instantaneous workload performance. Last-level caches impart the highest inertia, as workloads take tens of milliseconds to warm them up. When left unmanaged, or when managed with conventional QoS frameworks, shared last-level caches degrade tail latency significantly. Instead, we propose Ubik, a dynamic partitioning technique that predicts and exploits the transient behavior of latency-critical workloads to maintain their tail latency while maximizing the cache space available to batch applications. Using extensive simulations, we show that, while conventional QoS frameworks degrade tail latency by up to 2.3x, Ubik simultaneously maintains the tail latency of latency-critical workloads and significantly improves the performance of batch applications.