ReSense: Mapping dynamic workloads of colocated multithreaded applications using resource sensitivity

Authors:
Tanima Dey;Wei Wang;Jack W. Davidson;Mary Lou Soffa
Affiliations:
University of Virginia, Charlottesville, Virginia;University of Virginia, Charlottesville, Virginia;University of Virginia, Charlottesville, Virginia;University of Virginia, Charlottesville, Virginia
Venue:
ACM Transactions on Architecture and Code Optimization (TACO)
Year:
2013

Citing 33
Cited 0

Symbiotic jobscheduling for a simultaneous multithreaded processor

ASPLOS IX Proceedings of the ninth international conference on Architectural support for programming languages and operating systems
Introduction to Probability and Statistics: Principles and Applications for Engineering and the Computing Sciences

Introduction to Probability and Statistics: Principles and Applications for Engineering and the Computing Sciences
Predicting Inter-Thread Cache Contention on a Chip Multi-Processor Architecture

HPCA '05 Proceedings of the 11th International Symposium on High-Performance Computer Architecture
Methods for Modeling Resource Contention on Simultaneous Multithreading Processors

ICCD '05 Proceedings of the 2005 International Conference on Computer Design
Utility-Based Cache Partitioning: A Low-Overhead, High-Performance, Runtime Mechanism to Partition Shared Caches

Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
Managing Distributed, Shared L2 Caches through OS-Level Page Allocation

Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
Scheduling threads for constructive cache sharing on CMPs

Proceedings of the nineteenth annual ACM symposium on Parallel algorithms and architectures
Improving Performance Isolation on Chip Multiprocessors via an Operating System Scheduler

PACT '07 Proceedings of the 16th International Conference on Parallel Architecture and Compilation Techniques
Using OS Observations to Improve Performance in Multicore Systems

IEEE Micro
The PARSEC benchmark suite: characterization and architectural implications

Proceedings of the 17th international conference on Parallel architectures and compilation techniques
Analysis and approximation of optimal co-scheduling on chip multiprocessors

Proceedings of the 17th international conference on Parallel architectures and compilation techniques
RapidMRC: approximating L2 miss rate curves on commodity systems for online optimizations

Proceedings of the 14th international conference on Architectural support for programming languages and operating systems
Towards practical page coloring-based multicore cache management

Proceedings of the 4th ACM European conference on Computer systems
Reducing the harmful effects of last-level cache polluters with an OS-level, software-only pollute buffer

Proceedings of the 41st annual IEEE/ACM International Symposium on Microarchitecture
PIPP: promotion/insertion pseudo-partitioning of multi-core shared caches

Proceedings of the 36th annual international symposium on Computer architecture
SOS: A Software-Oriented Distributed Shared Cache Management Approach for Chip Multiprocessors

PACT '09 Proceedings of the 2009 18th International Conference on Parallel Architectures and Compilation Techniques
Addressing shared resource contention in multicore processors via scheduling

Proceedings of the fifteenth edition of ASPLOS on Architectural support for programming languages and operating systems
Contention aware execution: online contention detection and response

Proceedings of the 8th annual IEEE/ACM international symposium on Code generation and optimization
hwloc: A Generic Framework for Managing Hardware Affinities in HPC Applications

PDP '10 Proceedings of the 2010 18th Euromicro Conference on Parallel, Distributed and Network-based Processing
An approach to resource-aware co-scheduling for CMPs

Proceedings of the 24th ACM International Conference on Supercomputing
Directly characterizing cross core interference through contention synthesis

Proceedings of the 6th International Conference on High Performance and Embedded Architectures and Compilers
The impact of memory subsystem resource sharing on datacenter applications

Proceedings of the 38th annual international symposium on Computer architecture
Characterizing multi-threaded applications based on shared-resource contention

ISPASS '11 Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software
No More Backstabbing... A Faithful Scheduling Policy for Multithreaded Programs

PACT '11 Proceedings of the 2011 International Conference on Parallel Architectures and Compilation Techniques
Optimal task assignment in multithreaded processors: a statistical approach

ASPLOS XVII Proceedings of the seventeenth international conference on Architectural Support for Programming Languages and Operating Systems
CRUISE: cache replacement and utility-aware scheduling

ASPLOS XVII Proceedings of the seventeenth international conference on Architectural Support for Programming Languages and Operating Systems
REEact: a customizable virtual execution manager for multicore platforms

VEE '12 Proceedings of the 8th ACM SIGPLAN/SIGOPS conference on Virtual Execution Environments
Bubble-Up: increasing utilization in modern warehouse scale computers via sensible co-locations

Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture
A Discussion in Favor of Dynamic Scheduling for Regular Applications in Many-core Architectures

IPDPSW '12 Proceedings of the 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum
Measuring interference between live datacenter applications

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
ADAPT: A framework for coscheduling multithreaded programs

ACM Transactions on Architecture and Code Optimization (TACO) - Special Issue on High-Performance Embedded Architectures and Compilers
Application-to-core mapping policies to reduce memory system interference in multi-core systems

HPCA '13 Proceedings of the 2013 IEEE 19th International Symposium on High Performance Computer Architecture (HPCA)
Smart, adaptive mapping of parallelism in the presence of external workload

CGO '13 Proceedings of the 2013 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)

Quantified Score

Hi-index	0.00

Visualization

Abstract

To utilize the full potential of modern chip multiprocessors and obtain scalable performance improvements, it is critical to mitigate resource contention created by multithreaded workloads. In this article, we describe ReSense, the first runtime system that uses application characteristics to dynamically map multithreaded applications from dynamic workloads—workloads where multithreaded applications arrive, execute, and terminate continuously in unpredictable ways. ReSense mitigates contention for the shared resources in the memory hierarchy by applying a novel thread-mapping algorithm that dynamically adjusts the mapping of threads from dynamic workloads using a precalculated sensitivity score. The sensitivity score quantifies an application's sensitivity to sharing a particular memory resource and is calculated by an efficient characterization process that involves running the multithreaded application by itself on the target platform. To measure ReSense's effectiveness, sensitivity scores were determined for 21 benchmarks from PARSEC-2.1 and NPB-OMP-3.3 for the shared resources in the memory hierarchy on four different platforms. Using three different-sized dynamic workloads composed of randomly selected two, four, and eight corunning benchmarks with randomly selected start times, ReSense was able to improve the average response time of the three workloads by up to 27.03%, 20.89%, and 29.34% and throughput by up to 19.97%, 46.56%, and 29.86%, respectively, over the native OS on real hardware. By estimating and comparing ReSense's effectiveness with the optimal thread mapping for two different workloads, we found that the maximum average difference with the experimentally determined optimal performance was 1.49% for average response time and 2.08% for throughput.