Scheduling irregular parallel computations on hierarchical caches

Authors:
Guy E. Blelloch;Jeremy T. Fineman;Phillip B. Gibbons;Harsha Vardhan Simhadri
Affiliations:
Carnegie Mellon University, Pittsburgh, PA, USA;Carnegie Mellon University, Pittsburgh, PA, USA;Intel Labs Pittsburgh, Pittsburgh, PA, USA;Carnegie Mellon University, Pittsburgh, PA, USA
Venue:
Proceedings of the twenty-third annual ACM symposium on Parallelism in algorithms and architectures
Year:
2011

Citing 17
Cited 8

Fat-trees: universal networks for hardware-efficient supercomputing

IEEE Transactions on Computers
A bridging model for parallel computation

Communications of the ACM
LogP: towards a realistic model of parallel computation

PPOPP '93 Proceedings of the fourth ACM SIGPLAN symposium on Principles and practice of parallel programming
Introduction to Algorithms

Introduction to Algorithms
Dag-Consistent Distributed Shared Memory

IPPS '96 Proceedings of the 10th International Parallel Processing Symposium
Submachine Locality in the Bulk Synchronous Setting (Extended Abstract)

Euro-Par '96 Proceedings of the Second International Euro-Par Conference on Parallel Processing-Volume II
Cache-Oblivious Algorithms

FOCS '99 Proceedings of the 40th Annual Symposium on Foundations of Computer Science
Effectively sharing a cache among threads

Proceedings of the sixteenth annual ACM symposium on Parallelism in algorithms and architectures
Concurrent cache-oblivious b-trees

Proceedings of the seventeenth annual ACM symposium on Parallelism in algorithms and architectures
The cache complexity of multithreaded cache oblivious algorithms

Proceedings of the eighteenth annual ACM symposium on Parallelism in algorithms and architectures
The cache-oblivious gaussian elimination paradigm: theoretical framework, parallelization and experimental evaluation

Proceedings of the nineteenth annual ACM symposium on Parallel algorithms and architectures
Provably good multicore cache performance for divide-and-conquer algorithms

Proceedings of the nineteenth annual ACM-SIAM symposium on Discrete algorithms
Fundamental parallel algorithms for private-cache chip multiprocessors

Proceedings of the twentieth annual symposium on Parallelism in algorithms and architectures
Cache-efficient dynamic programming algorithms for multicores

Proceedings of the twentieth annual symposium on Parallelism in algorithms and architectures
A Bridging Model for Multi-core Computing

ESA '08 Proceedings of the 16th annual European symposium on Algorithms
Low depth cache-oblivious algorithms

Proceedings of the twenty-second annual ACM symposium on Parallelism in algorithms and architectures
Resource oblivious sorting on multicores

ICALP'10 Proceedings of the 37th international colloquium conference on Automata, languages and programming

CATS: cache aware task-stealing based on online profiling in multi-socket multi-core architectures

Proceedings of the 26th ACM international conference on Supercomputing
Parallel and I/O efficient set covering algorithms

Proceedings of the twenty-fourth annual ACM symposium on Parallelism in algorithms and architectures
Betweenness centrality: algorithms and implementations

Proceedings of the 18th ACM SIGPLAN symposium on Principles and practice of parallel programming
Design and implementation of a customizable work stealing scheduler

Proceedings of the 3rd International Workshop on Runtime and Operating Systems for Supercomputers
Program-centric cost models for locality

Proceedings of the ACM SIGPLAN Workshop on Memory Systems Performance and Correctness
Parallel triangle counting in massive streaming graphs

Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
A memory access model for highly-threaded many-core architectures

Future Generation Computer Systems
Measurement of the latency parameters of the Multi-BSP model: a multicore benchmarking approach

The Journal of Supercomputing

Quantified Score

Hi-index	0.00

Visualization

Abstract

For nested-parallel computations with low depth (span, critical path length) analyzing the work, depth, and sequential cache complexity suffices to attain reasonably strong bounds on the parallel runtime and cache complexity on machine models with either shared or private caches. These bounds, however, do not extend to general hierarchical caches, due to limitations in (i) the cache-oblivious (CO) model used to analyze cache complexity and (ii) the schedulers used to map computation tasks to processors. This paper presents the parallel cache-oblivious (PCO) model, a relatively simple modification to the CO model that can be used to account for costs on a broad range of cache hierarchies. The first change is to avoid capturing artificial data sharing among parallel threads, and the second is to account for parallelism-memory imbalances within tasks. Despite the more restrictive nature of PCO compared to CO, many algorithms have the same asymptotic cache complexity bounds. The paper then describes a new scheduler for hierarchical caches, which extends recent work on "space-bounded schedulers" to allow for computations with arbitrary work imbalance among parallel subtasks. This scheduler attains provably good cache performance and runtime on parallel machine models with hierarchical caches, for nested-parallel computations analyzed using the PCO model. We show that under reasonable assumptions our scheduler is "work efficient" in the sense that the cost of the cache misses are evenly balanced across the processors---i.e., the runtime can be determined within a constant factor by taking the total cost of the cache misses analyzed for a computation and dividing it by the number of processors. In contrast, to further support our model, we show that no scheduler can achieve such bounds (optimizing for both cache misses and runtime) if work, depth, and sequential cache complexity are the only parameters used to analyze a computation.