Compiling for niceness: mitigating contention for QoS in warehouse scale computers

Authors:
Lingjia Tang;Jason Mars;Mary Lou Soffa
Affiliations:
University of Virginia;University of Virginia;University of Virginia
Venue:
Proceedings of the Tenth International Symposium on Code Generation and Optimization
Year:
2012

Citing 37
Cited 6

Web Search for a Planet: The Google Cluster Architecture

IEEE Micro
Fair Cache Sharing and Partitioning in a Chip Multiprocessor Architecture

Proceedings of the 13th International Conference on Parallel Architectures and Compilation Techniques
Enterprise IT Trends and Implications for Architecture Research

HPCA '05 Proceedings of the 11th International Symposium on High-Performance Computer Architecture
Pin: building customized program analysis tools with dynamic instrumentation

Proceedings of the 2005 ACM SIGPLAN conference on Programming language design and implementation
Fair Queuing Memory Systems

Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
Utility-Based Cache Partitioning: A Low-Overhead, High-Performance, Runtime Mechanism to Partition Shared Caches

Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
Managing Distributed, Shared L2 Caches through OS-Level Page Allocation

Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
QoS policies and architecture for cache/memory in CMP platforms

Proceedings of the 2007 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Improving Performance Isolation on Chip Multiprocessors via an Operating System Scheduler

PACT '07 Proceedings of the 16th International Conference on Parallel Architecture and Compilation Techniques
A Framework for Providing Quality of Service in Chip Multi-Processors

Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture
Adaptive set pinning: managing shared caches in chip multiprocessors

Proceedings of the 13th international conference on Architectural support for programming languages and operating systems
PAM: a novel performance/power aware meta-scheduler for multi-core systems

Proceedings of the 2008 ACM/IEEE conference on Supercomputing
Using OS Observations to Improve Performance in Multicore Systems

IEEE Micro
Analysis and approximation of optimal co-scheduling on chip multiprocessors

Proceedings of the 17th international conference on Parallel architectures and compilation techniques
A compiler-directed data prefetching scheme for chip multiprocessors

Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming
PowerNap: eliminating server idle power

Proceedings of the 14th international conference on Architectural support for programming languages and operating systems
Reducing the harmful effects of last-level cache polluters with an OS-level, software-only pollute buffer

Proceedings of the 41st annual IEEE/ACM International Symposium on Microarchitecture
Rate-based QoS techniques for cache/memory in CMP platforms

Proceedings of the 23rd international conference on Supercomputing
Scenario Based Optimization: A Framework for Statically Enabling Online Optimizations

Proceedings of the 7th annual IEEE/ACM International Symposium on Code Generation and Optimization
Optimizing shared cache behavior of chip multiprocessors

Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture
Does cache sharing on modern CMP matter to the performance of contemporary multithreaded programs?

Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
Addressing shared resource contention in multicore processors via scheduling

Proceedings of the fifteenth edition of ASPLOS on Architectural support for programming languages and operating systems
Fairness via source throttling: a configurable and high-performance fairness substrate for multi-core memory systems

Proceedings of the fifteenth edition of ASPLOS on Architectural support for programming languages and operating systems
Q-clouds: managing performance interference effects for QoS-aware clouds

Proceedings of the 5th European conference on Computer systems
Contention aware execution: online contention detection and response

Proceedings of the 8th annual IEEE/ACM international symposium on Code generation and optimization
Cache topology aware computation mapping for multicores

PLDI '10 Proceedings of the 2010 ACM SIGPLAN conference on Programming language design and implementation
An approach to resource-aware co-scheduling for CMPs

Proceedings of the 24th ACM International Conference on Supercomputing
On mitigating memory bandwidth contention through bandwidth-aware scheduling

Proceedings of the 19th international conference on Parallel architectures and compilation techniques
Hardware execution throttling for multi-core resource management

USENIX'09 Proceedings of the 2009 conference on USENIX Annual technical conference
Reducing Cache Pollution Through Detection and Elimination of Non-Temporal Memory Accesses

Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
All-window profiling and composable models of cache sharing

Proceedings of the 16th ACM symposium on Principles and practice of parallel programming
Dynamic cache contention detection in multi-threaded applications

Proceedings of the 7th ACM SIGPLAN/SIGOPS international conference on Virtual execution environments
The impact of memory subsystem resource sharing on datacenter applications

Proceedings of the 38th annual international symposium on Computer architecture
Bubble-Up: increasing utilization in modern warehouse scale computers via sensible co-locations

Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture
Combining locality analysis with online proactive job co-scheduling in chip multiprocessors

HiPEAC'10 Proceedings of the 5th international conference on High Performance Embedded Architectures and Compilers
Automated locality optimization based on the reuse distance of string operations

CGO '11 Proceedings of the 9th Annual IEEE/ACM International Symposium on Code Generation and Optimization
MAO -- An extensible micro-architectural optimizer

CGO '11 Proceedings of the 9th Annual IEEE/ACM International Symposium on Code Generation and Optimization

ReQoS: reactive static/dynamic compilation for QoS in warehouse scale computers

Proceedings of the eighteenth international conference on Architectural support for programming languages and operating systems
Bubble-flux: precise online QoS management for increased utilization in warehouse scale computers

Proceedings of the 40th Annual International Symposium on Computer Architecture
Whare-map: heterogeneity in "homogeneous" warehouse-scale computers

Proceedings of the 40th Annual International Symposium on Computer Architecture
GoldRush: resource efficient in situ scientific data analytics using fine-grained interference aware execution

SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Librando: transparent code randomization for just-in-time compilers

Proceedings of the 2013 ACM SIGSAC conference on Computer & communications security
An empirical model for predicting cross-core performance interference on multicore processors

PACT '13 Proceedings of the 22nd international conference on Parallel architectures and compilation techniques

Quantified Score

Hi-index	0.00

Visualization

Abstract

As the class of datacenters recently coined as warehouse scale computers (WSCs) continues to leverage commodity multicore processors with increasing core counts, there is a growing need to consolidate various workloads on these machines to fully utilize their computation power. However, it is well known that when multiple applications are co-located on a multicore machine, contention for shared memory resources can cause severe cross-core performance interference. To ensure that the quality of service (QoS) of user-facing applications does not suffer from performance interference, WSC operators resort to disallowing co-location of latency-sensitive applications with other applications. This policy translates to low machine utilization and millions of dollars wasted in WSCs. This paper presents QoS-Compile, the first compilation approach that statically manipulates application contentiousness to enable the co-location of applications with varying QoS requirements, and as a result, can greatly improve machine utilization. Our technique first pinpoints an application's code regions that tend to cause contention and performance interference. QoS-Compile then transforms those regions to reduce their contentious nature. In essence, to co-locate applications of different QoS priorities, our compilation technique uses pessimizing transformations to throttle down the memory access rate of the contentious regions in low priority applications to reduce their interference to high priority applications. Our evaluation using synthetic benchmarks, SPEC benchmarks and large-scale Google applications show that QoS-Compile can greatly reduce contention, improve QoS of applications, and improve machine utilization. Our experiments show that our technique improves applications' QoS performance by 21% and machine utilization by 36% on average.