The impact of memory subsystem resource sharing on datacenter applications

Authors:
Lingjia Tang;Jason Mars;Neil Vachharajani;Robert Hundt;Mary Lou Soffa
Affiliations:
University of Virginia, Charlottesville, VA, USA;University of Virginia, Charlottesville, VA, USA;Pure Storage, Mountain View, CA, USA;Google, Mountain View, CA, USA;University of Virginia, Charlottesville, VA, USA
Venue:
Proceedings of the 38th annual international symposium on Computer architecture
Year:
2011

Citing 40
Cited 26

Symbiotic jobscheduling for a simultaneous multithreaded processor

ASPLOS IX Proceedings of the ninth international conference on Architectural support for programming languages and operating systems
Web Search for a Planet: The Google Cluster Architecture

IEEE Micro
A New Memory Monitoring Scheme for Memory-Aware Scheduling and Partitioning

HPCA '02 Proceedings of the 8th International Symposium on High-Performance Computer Architecture
Effectively sharing a cache among threads

Proceedings of the sixteenth annual ACM symposium on Parallelism in algorithms and architectures
Predicting Inter-Thread Cache Contention on a Chip Multi-Processor Architecture

HPCA '05 Proceedings of the 11th International Symposium on High-Performance Computer Architecture
Architectural support for operating system-driven CMP cache management

Proceedings of the 15th international conference on Parallel architectures and compilation techniques
IPC Considered Harmful for Multiprocessor Workloads

IEEE Micro
Utility-Based Cache Partitioning: A Low-Overhead, High-Performance, Runtime Mechanism to Partition Shared Caches

Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
QoS policies and architecture for cache/memory in CMP platforms

Proceedings of the 2007 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Thread clustering: sharing-aware scheduling on SMP-CMP-SMT multiprocessors

Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007
Cooperative cache partitioning for chip multiprocessors

Proceedings of the 21st annual international conference on Supercomputing
Image alignment and stitching: a tutorial

Foundations and Trends® in Computer Graphics and Vision
Improving Performance Isolation on Chip Multiprocessors via an Operating System Scheduler

PACT '07 Proceedings of the 16th International Conference on Parallel Architecture and Compilation Techniques
A Framework for Providing Quality of Service in Chip Multi-Processors

Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture
Adaptive set pinning: managing shared caches in chip multiprocessors

Proceedings of the 13th international conference on Architectural support for programming languages and operating systems
What can performance counters do for memory subsystem analysis?

Proceedings of the 2008 ACM SIGPLAN workshop on Memory systems performance and correctness: held in conjunction with the Thirteenth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS '08)
Bigtable: A Distributed Storage System for Structured Data

ACM Transactions on Computer Systems (TOCS)
PAM: a novel performance/power aware meta-scheduler for multi-core systems

Proceedings of the 2008 ACM/IEEE conference on Supercomputing
Multicore Resource Management

IEEE Micro
Using OS Observations to Improve Performance in Multicore Systems

IEEE Micro
Analysis and approximation of optimal co-scheduling on chip multiprocessors

Proceedings of the 17th international conference on Parallel architectures and compilation techniques
FlexDCP: a QoS framework for CMP architectures

ACM SIGOPS Operating Systems Review
Rate-based QoS techniques for cache/memory in CMP platforms

Proceedings of the 23rd international conference on Supercomputing
Reactive NUCA: near-optimal block placement and replication in distributed caches

Proceedings of the 36th annual international symposium on Computer architecture
Optimizing shared cache behavior of chip multiprocessors

Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture
Thread to strand binding of parallel network applications in massive multi-threaded systems

Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
Does cache sharing on modern CMP matter to the performance of contemporary multithreaded programs?

Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
Probabilistic job symbiosis modeling for SMT processor scheduling

Proceedings of the fifteenth edition of ASPLOS on Architectural support for programming languages and operating systems
Addressing shared resource contention in multicore processors via scheduling

Proceedings of the fifteenth edition of ASPLOS on Architectural support for programming languages and operating systems
Fairness via source throttling: a configurable and high-performance fairness substrate for multi-core memory systems

Proceedings of the fifteenth edition of ASPLOS on Architectural support for programming languages and operating systems
Resource-conscious scheduling for energy efficiency on multicore processors

Proceedings of the 5th European conference on Computer systems
Contention aware execution: online contention detection and response

Proceedings of the 8th annual IEEE/ACM international symposium on Code generation and optimization
Compiler techniques for reducing data cache miss rate on a multithreaded architecture

HiPEAC'08 Proceedings of the 3rd international conference on High performance embedded architectures and compilers
Software data spreading: leveraging distributed caches to improve single thread performance

PLDI '10 Proceedings of the 2010 ACM SIGPLAN conference on Programming language design and implementation
An approach to resource-aware co-scheduling for CMPs

Proceedings of the 24th ACM International Conference on Supercomputing
Web search using mobile cores: quantifying and mitigating the price of efficiency

Proceedings of the 37th annual international symposium on Computer architecture
The impact of management operations on the virtualized datacenter

Proceedings of the 37th annual international symposium on Computer architecture
Server Engineering Insights for Large-Scale Online Services

IEEE Micro
Directly characterizing cross core interference through contention synthesis

Proceedings of the 6th International Conference on High Performance and Embedded Architectures and Compilers
Combining locality analysis with online proactive job co-scheduling in chip multiprocessors

HiPEAC'10 Proceedings of the 5th international conference on High Performance Embedded Architectures and Compilers

Cuanta: quantifying effects of shared on-chip resource interference for consolidated virtual machines

Proceedings of the 2nd ACM Symposium on Cloud Computing
Clearing the clouds: a study of emerging scale-out workloads on modern hardware

ASPLOS XVII Proceedings of the seventeenth international conference on Architectural Support for Programming Languages and Operating Systems
CRUISE: cache replacement and utility-aware scheduling

ASPLOS XVII Proceedings of the seventeenth international conference on Architectural Support for Programming Languages and Operating Systems
Bubble-Up: increasing utilization in modern warehouse scale computers via sensible co-locations

Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture
Fairness via Source Throttling: A Configurable and High-Performance Fairness Substrate for Multicore Memory Systems

ACM Transactions on Computer Systems (TOCS)
Towards workload-aware virtual machine consolidation on cloud platforms

Proceedings of the 6th International Conference on Ubiquitous Information Management and Communication
Toward predictable performance in software packet-processing platforms

NSDI'12 Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation
Compiling for niceness: mitigating contention for QoS in warehouse scale computers

Proceedings of the Tenth International Symposium on Code Generation and Optimization
Interference-driven resource management for GPU-based heterogeneous clusters

Proceedings of the 21st international symposium on High-Performance Parallel and Distributed Computing
Towards energy-proportional datacenter memory with mobile DRAM

Proceedings of the 39th Annual International Symposium on Computer Architecture
OLTP on hardware islands

Proceedings of the VLDB Endowment
Optimizing datacenter power with memory system levers for guaranteed quality-of-service

Proceedings of the 21st international conference on Parallel architectures and compilation techniques
Bandwidth bandit: quantitative characterization of memory contention

Proceedings of the 21st international conference on Parallel architectures and compilation techniques
Quantifying the Mismatch between Emerging Scale-Out Applications and Modern Processors

ACM Transactions on Computer Systems (TOCS)
Measuring interference between live datacenter applications

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
To hardware prefetch or not to prefetch?: a virtualized environment study and core binding approach

Proceedings of the eighteenth international conference on Architectural support for programming languages and operating systems
COSMIC: middleware for high performance and reliable multiprocessing on xeon phi coprocessors

Proceedings of the 22nd international symposium on High-performance parallel and distributed computing
Resource efficient computing for warehouse-scale datacenters

Proceedings of the Conference on Design, Automation and Test in Europe
Navigating big data with high-throughput, energy-efficient data partitioning

Proceedings of the 40th Annual International Symposium on Computer Architecture
A hardware evaluation of cache partitioning to improve utilization and energy-efficiency while preserving responsiveness

Proceedings of the 40th Annual International Symposium on Computer Architecture
Bubble-flux: precise online QoS management for increased utilization in warehouse scale computers

Proceedings of the 40th Annual International Symposium on Computer Architecture
Whare-map: heterogeneity in "homogeneous" warehouse-scale computers

Proceedings of the 40th Annual International Symposium on Computer Architecture
L1-bandwidth aware thread allocation in multicore SMT processors

PACT '13 Proceedings of the 22nd international conference on Parallel architectures and compilation techniques
An empirical model for predicting cross-core performance interference on multicore processors

PACT '13 Proceedings of the 22nd international conference on Parallel architectures and compilation techniques
ReSense: Mapping dynamic workloads of colocated multithreaded applications using resource sensitivity

ACM Transactions on Architecture and Code Optimization (TACO)
Exploiting multi-core nodes in peer-to-peer grids

Journal of Parallel and Distributed Computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper we study the impact of sharing memory resources on five Google datacenter applications: a web search engine, bigtable, content analyzer, image stitching, and protocol buffer. While prior work has found neither positive nor negative effects from cache sharing across the PARSEC benchmark suite, we find that across these datacenter applications, there is both a sizable benefit and a potential degradation from improperly sharing resources. There are four main contributions of this paper. First, we present a study of the importance of thread-to-core mapping for applications in the datacenter as threads can be mapped to share or to not share caches and bus bandwidth. Second, we investigate the impact of co-locating threads from multiple applications with diverse memory behavior and discover that the best mapping for a given application changes de- pending on its co-runner. Third, we investigate the application characteristics that impact performance in the various thread-to-core mapping scenarios. Finally, we present both a heuristics-based and an adaptive approach to arrive at good thread-to-core decisions in the datacenter. We observe performance swings of up to 25% for web search, and 40% for other key applications, simply based on how application threads are mapped to cores. By employing our adaptive thread to core mapper the performance of the datacenter applications presented in this work improved by up to 22% over status quo thread-to-core mapping and performs within 3% of optimal.