An empirical model for predicting cross-core performance interference on multicore processors

Authors:
Jiacheng Zhao;Huimin Cui;Jingling Xue;Xiaobing Feng;Youliang Yan;Wensen Yang
Affiliations:
Institute of Computing Technology, CAS & University of Chinese Academy of Sciences, Beijing, China;Institute of Computing Technology, CAS, Beijing, China;University of New South Wales, Sydney, Australia;Institute of Computing Technology, CAS, Beijing, China;Huawei Technologies Co., Ltd., Shenzhen, China;Huawei Technologies Co., Ltd., Shenzhen, China
Venue:
PACT '13 Proceedings of the 22nd international conference on Parallel architectures and compilation techniques
Year:
2013

Citing 41
Cited 0

Data Caches in Multitasking Hard Real-Time Systems

RTSS '03 Proceedings of the 24th IEEE International Real-Time Systems Symposium
Fair Cache Sharing and Partitioning in a Chip Multiprocessor Architecture

Proceedings of the 13th International Conference on Parallel Architectures and Compilation Techniques
Predicting Inter-Thread Cache Contention on a Chip Multi-Processor Architecture

HPCA '05 Proceedings of the 11th International Symposium on High-Performance Computer Architecture
A Case for MLP-Aware Cache Replacement

Proceedings of the 33rd annual international symposium on Computer Architecture
An analytical model for cache replacement policy performance

SIGMETRICS '06/Performance '06 Proceedings of the joint international conference on Measurement and modeling of computer systems
Utility-Based Cache Partitioning: A Low-Overhead, High-Performance, Runtime Mechanism to Partition Shared Caches

Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
Managing Distributed, Shared L2 Caches through OS-Level Page Allocation

Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
Scheduling threads for constructive cache sharing on CMPs

Proceedings of the nineteenth annual ACM symposium on Parallel algorithms and architectures
Cooperative cache partitioning for chip multiprocessors

Proceedings of the 21st annual international conference on Supercomputing
Improving Performance Isolation on Chip Multiprocessors via an Operating System Scheduler

PACT '07 Proceedings of the 16th International Conference on Parallel Architecture and Compilation Techniques
Data cache locking for tight timing calculations

ACM Transactions on Embedded Computing Systems (TECS)
The Case for Energy-Proportional Computing

Computer
Cache modeling in probabilistic execution time analysis

Proceedings of the 45th annual Design Automation Conference
Multicore Resource Management

IEEE Micro
Using OS Observations to Improve Performance in Multicore Systems

IEEE Micro
The PARSEC benchmark suite: characterization and architectural implications

Proceedings of the 17th international conference on Parallel architectures and compilation techniques
Analysis and approximation of optimal co-scheduling on chip multiprocessors

Proceedings of the 17th international conference on Parallel architectures and compilation techniques
Towards practical page coloring-based multicore cache management

Proceedings of the 4th ACM European conference on Computer systems
PIPP: promotion/insertion pseudo-partitioning of multi-core shared caches

Proceedings of the 36th annual international symposium on Computer architecture
Predicting cache needs and cache sensitivity for applications in cloud computing on CMP servers with configurable caches

IPDPS '09 Proceedings of the 2009 IEEE International Symposium on Parallel&Distributed Processing
A case for integrated processor-cache partitioning in chip multiprocessors

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Evaluation techniques for storage hierarchies

IBM Systems Journal
Addressing shared resource contention in multicore processors via scheduling

Proceedings of the fifteenth edition of ASPLOS on Architectural support for programming languages and operating systems
Level by level: making flow- and context-sensitive pointer analysis scalable for millions of lines of code

Proceedings of the 8th annual IEEE/ACM international symposium on Code generation and optimization
Contention aware execution: online contention detection and response

Proceedings of the 8th annual IEEE/ACM international symposium on Code generation and optimization
On mitigating memory bandwidth contention through bandwidth-aware scheduling

Proceedings of the 19th international conference on Parallel architectures and compilation techniques
Get the parallelism out of my cloud

HotPar'10 Proceedings of the 2nd USENIX conference on Hot topics in parallelism
Directly characterizing cross core interference through contention synthesis

Proceedings of the 6th International Conference on High Performance and Embedded Architectures and Compilers
The impact of memory subsystem resource sharing on datacenter applications

Proceedings of the 38th annual international symposium on Computer architecture
Contentiousness vs. sensitivity: improving contention aware runtime systems on multicore architectures

Proceedings of the 1st International Workshop on Adaptive Self-Tuning Computing Systems for the Exaflop Era
Automatic Library Generation for BLAS3 on GPUs

IPDPS '11 Proceedings of the 2011 IEEE International Parallel & Distributed Processing Symposium
Bubble-Up: increasing utilization in modern warehouse scale computers via sensible co-locations

Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture
Combining locality analysis with online proactive job co-scheduling in chip multiprocessors

HiPEAC'10 Proceedings of the 5th international conference on High Performance Embedded Architectures and Compilers
Extendable pattern-oriented optimization directives

CGO '11 Proceedings of the 9th Annual IEEE/ACM International Symposium on Code Generation and Optimization
Providing fairness on shared-memory multiprocessors via process scheduling

Proceedings of the 12th ACM SIGMETRICS/PERFORMANCE joint international conference on Measurement and Modeling of Computer Systems
Compiling for niceness: mitigating contention for QoS in warehouse scale computers

Proceedings of the Tenth International Symposium on Code Generation and Optimization
On-demand dynamic summary-based points-to analysis

Proceedings of the Tenth International Symposium on Code Generation and Optimization
Extendable pattern-oriented optimization directives

ACM Transactions on Architecture and Code Optimization (TACO)
Survey of scheduling techniques for addressing shared resources in multicore processors

ACM Computing Surveys (CSUR)
Query-directed adaptive heap cloning for optimizing compilers

CGO '13 Proceedings of the 2013 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)
Defensive loop tiling for shared cache

CGO '13 Proceedings of the 2013 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)

Quantified Score

Hi-index	0.00

Visualization

Abstract

Despite their widespread adoption in cloud computing, multicore processors are heavily under-utilized in terms of computing resources. To avoid the potential for negative and unpredictable interference, co-location of a latency-sensitive application with others on the same multicore processor is disallowed, leaving many cores idle and causing low machine utilization. To enable co-location while providing QoS guarantees, it is challenging but important to predict performance interference between co-located applications. This research is driven by two key insights. First, the performance degradation of an application can be represented as a predictor function of the aggregate pressures on shared resources from all cores, regardless of which applications are co-running and what their individual pressures are. Second, a predictor function is piecewise rather than non-piecewise as in prior work, thereby enabling different types of dominant contention factors to be more accurately captured by different subfunctions in its different subdomains. Based on these insights, we propose to adopt a two-phase regression approach to efficiently building a predictor function. Validation using a large number of benchmarks and nine real-world datacenter applications on three different platforms shows that our approach is also precise, with an average error not exceeding 0.4%. When applied to the nine datacenter applications, our approach improves overall resource utilization from 50% to 88% at the cost of 10% QoS degradation.