Cilk: an efficient multithreaded runtime system
Journal of Parallel and Distributed Computing - Special issue on multithreading for multiprocessors
Programming with POSIX threads
Programming with POSIX threads
The implementation of the Cilk-5 multithreaded language
PLDI '98 Proceedings of the ACM SIGPLAN 1998 conference on Programming language design and implementation
Using MPI (2nd ed.): portable parallel programming with the message-passing interface
Using MPI (2nd ed.): portable parallel programming with the message-passing interface
Online performance analysis by statistical sampling of microprocessor performance counters
Proceedings of the 19th annual international conference on Supercomputing
Scheduling threads for constructive cache sharing on CMPs
Proceedings of the nineteenth annual ACM symposium on Parallel algorithms and architectures
Provably good multicore cache performance for divide-and-conquer algorithms
Proceedings of the nineteenth annual ACM-SIAM symposium on Discrete algorithms
Maotai: View-Oriented Parallel Programming on CMT Processors
ICPP '08 Proceedings of the 2008 37th International Conference on Parallel Processing
Intel threading building blocks
Intel threading building blocks
Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming
RapidMRC: approximating L2 miss rate curves on commodity systems for online optimizations
Proceedings of the 14th international conference on Architectural support for programming languages and operating systems
IEEE Transactions on Parallel and Distributed Systems
Less reused filter: improving l2 cache performance via filtering less reused lines
Proceedings of the 23rd international conference on Supercomputing
Work-first and help-first scheduling policies for async-finish task parallelism
IPDPS '09 Proceedings of the 2009 IEEE International Symposium on Parallel&Distributed Processing
The Cilk++ concurrency platform
Proceedings of the 46th Annual Design Automation Conference
Featherweight X10: a core calculus for async-finish parallelism
Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
EuroPar'10 Proceedings of the 16th international Euro-Par conference on Parallel processing: Part I
ULCC: a user-level facility for optimizing shared cache performance on multicores
Proceedings of the 16th ACM symposium on Principles and practice of parallel programming
Scheduling task parallelism on multi-socket multicore systems
Proceedings of the 1st International Workshop on Runtime and Operating Systems for Supercomputers
Scheduling irregular parallel computations on hierarchical caches
Proceedings of the twenty-third annual ACM symposium on Parallelism in algorithms and architectures
Controlling cache utilization of HPC applications
Proceedings of the international conference on Supercomputing
CAB: Cache Aware Bi-tier Task-Stealing in Multi-socket Multi-core Architecture
ICPP '11 Proceedings of the 2011 International Conference on Parallel Processing
WATS: Workload-Aware Task Scheduling in Asymmetric Multi-core Architectures
IPDPS '12 Proceedings of the 2012 IEEE 26th International Parallel and Distributed Processing Symposium
Palirria: Accurate On-line Parallelism Estimation for Adaptive Work-Stealing
Proceedings of Programming Models and Applications on Multicores and Manycores
DWS: Demand-aware Work-Stealing in Multi-programmed Multi-core Architectures
Proceedings of Programming Models and Applications on Multicores and Manycores
Adaptive workload-aware task scheduling for single-ISA asymmetric multicore architectures
ACM Transactions on Architecture and Code Optimization (TACO)
Hi-index | 0.00 |
Multi-socket Multi-core architectures with shared caches in each socket have become mainstream when a single multi-core chip cannot provide enough computing capacity for high performance computing. However, traditional task-stealing schedulers tend to pollute the shared cache and incur severe cache misses due to their randomness in stealing. To address the problem, this paper proposes a Cache Aware Task-Stealing (CATS) scheduler, which uses the shared cache efficiently with an online profiling method and schedules tasks with shared data to the same socket. CATS adopts an online DAG partitioner based on the profiling information to ensure tasks with shared data can efficiently utilize the shared cache. One outstanding novelty of CATS is that it does not require any extra user-provided information. Experimental results show that CATS can improve the performance of memory-bound programs up to 74.4% compared with the traditional task-stealing scheduler.