CATS: cache aware task-stealing based on online profiling in multi-socket multi-core architectures

Authors:
Quan Chen;Minyi Guo;Zhiyi Huang
Affiliations:
Shanghai Jiao Tong University, Shanghai, China;Department of Computer Science and Engineering, Shanghai, China;University of Otago, Dunedin, New Zealand
Venue:
Proceedings of the 26th ACM international conference on Supercomputing
Year:
2012

Citing 23
Cited 3

Cilk: an efficient multithreaded runtime system

Journal of Parallel and Distributed Computing - Special issue on multithreading for multiprocessors
Programming with POSIX threads

Programming with POSIX threads
The implementation of the Cilk-5 multithreaded language

PLDI '98 Proceedings of the ACM SIGPLAN 1998 conference on Programming language design and implementation
Using MPI (2nd ed.): portable parallel programming with the message-passing interface

Using MPI (2nd ed.): portable parallel programming with the message-passing interface
Online performance analysis by statistical sampling of microprocessor performance counters

Proceedings of the 19th annual international conference on Supercomputing
Scheduling threads for constructive cache sharing on CMPs

Proceedings of the nineteenth annual ACM symposium on Parallel algorithms and architectures
Provably good multicore cache performance for divide-and-conquer algorithms

Proceedings of the nineteenth annual ACM-SIAM symposium on Discrete algorithms
Maotai: View-Oriented Parallel Programming on CMT Processors

ICPP '08 Proceedings of the 2008 37th International Conference on Parallel Processing
Intel threading building blocks

Intel threading building blocks
Idempotent work stealing

Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming
RapidMRC: approximating L2 miss rate curves on commodity systems for online optimizations

Proceedings of the 14th international conference on Architectural support for programming languages and operating systems
The Design of OpenMP Tasks

IEEE Transactions on Parallel and Distributed Systems
Less reused filter: improving l2 cache performance via filtering less reused lines

Proceedings of the 23rd international conference on Supercomputing
Work-first and help-first scheduling policies for async-finish task parallelism

IPDPS '09 Proceedings of the 2009 IEEE International Symposium on Parallel&Distributed Processing
The Cilk++ concurrency platform

Proceedings of the 46th Annual Design Automation Conference
Featherweight X10: a core calculus for async-finish parallelism

Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
Hierarchical work-stealing

EuroPar'10 Proceedings of the 16th international Euro-Par conference on Parallel processing: Part I
ULCC: a user-level facility for optimizing shared cache performance on multicores

Proceedings of the 16th ACM symposium on Principles and practice of parallel programming
Scheduling task parallelism on multi-socket multicore systems

Proceedings of the 1st International Workshop on Runtime and Operating Systems for Supercomputers
Scheduling irregular parallel computations on hierarchical caches

Proceedings of the twenty-third annual ACM symposium on Parallelism in algorithms and architectures
Controlling cache utilization of HPC applications

Proceedings of the international conference on Supercomputing
CAB: Cache Aware Bi-tier Task-Stealing in Multi-socket Multi-core Architecture

ICPP '11 Proceedings of the 2011 International Conference on Parallel Processing
WATS: Workload-Aware Task Scheduling in Asymmetric Multi-core Architectures

IPDPS '12 Proceedings of the 2012 IEEE 26th International Parallel and Distributed Processing Symposium

Palirria: Accurate On-line Parallelism Estimation for Adaptive Work-Stealing

Proceedings of Programming Models and Applications on Multicores and Manycores
DWS: Demand-aware Work-Stealing in Multi-programmed Multi-core Architectures

Proceedings of Programming Models and Applications on Multicores and Manycores
Adaptive workload-aware task scheduling for single-ISA asymmetric multicore architectures

ACM Transactions on Architecture and Code Optimization (TACO)

Quantified Score

Hi-index	0.00

Visualization

Abstract

Multi-socket Multi-core architectures with shared caches in each socket have become mainstream when a single multi-core chip cannot provide enough computing capacity for high performance computing. However, traditional task-stealing schedulers tend to pollute the shared cache and incur severe cache misses due to their randomness in stealing. To address the problem, this paper proposes a Cache Aware Task-Stealing (CATS) scheduler, which uses the shared cache efficiently with an online profiling method and schedules tasks with shared data to the same socket. CATS adopts an online DAG partitioner based on the profiling information to ensure tasks with shared data can efficiently utilize the shared cache. One outstanding novelty of CATS is that it does not require any extra user-provided information. Experimental results show that CATS can improve the performance of memory-bound programs up to 74.4% compared with the traditional task-stealing scheduler.