Low depth cache-oblivious algorithms

Authors:
Guy E. Blelloch;Phillip B. Gibbons;Harsha Vardhan Simhadri
Affiliations:
Carnegie Mellon University, Pittsburgh, USA;Intel Research Pittsburgh, Pittsburgh, USA;Carnegie Mellon University, Pittsuburgh, USA
Venue:
Proceedings of the twenty-second annual ACM symposium on Parallelism in algorithms and architectures
Year:
2010

Citing 35
Cited 16

Amortized efficiency of list update and paging rules

Communications of the ACM
Deterministic coin tossing and accelerating cascades: micro and macro techniques for designing parallel algorithms

STOC '86 Proceedings of the eighteenth annual ACM symposium on Theory of computing
A model for hierarchical memory

STOC '87 Proceedings of the nineteenth annual ACM symposium on Theory of computing
Optimal and sublogarithmic time randomized parallel sorting algorithms

SIAM Journal on Computing
A bridging model for parallel computation

Communications of the ACM
An introduction to parallel algorithms

An introduction to parallel algorithms
LogP: towards a realistic model of parallel computation

PPOPP '93 Proceedings of the fourth ACM SIGPLAN symposium on Principles and practice of parallel programming
Programming parallel algorithms

Communications of the ACM
Cilk: an efficient multithreaded runtime system

Journal of Parallel and Distributed Computing - Special issue on multithreading for multiprocessors
Thread scheduling for multiprogrammed multiprocessors

Proceedings of the tenth annual ACM symposium on Parallel algorithms and architectures
Provably efficient scheduling for languages with fine-grained parallelism

Journal of the ACM (JACM)
External-memory graph algorithms

Proceedings of the sixth annual ACM-SIAM symposium on Discrete algorithms
Samplesort: A Sampling Approach to Minimal Storage Tree Sorting

Journal of the ACM (JACM)
Scheduling multithreaded computations by work stealing

Journal of the ACM (JACM)
Cache-oblivious priority queue and graph algorithm applications

STOC '02 Proceedings of the thiry-fourth annual ACM symposium on Theory of computing
Dag-Consistent Distributed Shared Memory

IPPS '96 Proceedings of the 10th International Parallel Processing Symposium
Cache Oblivious Distribution Sweeping

ICALP '02 Proceedings of the 29th International Colloquium on Automata, Languages and Programming
Cache-Oblivious Algorithms

FOCS '99 Proceedings of the 40th Annual Symposium on Foundations of Computer Science
Proximity Mergesort: optimal in-place sorting in the cache-oblivious model

SODA '04 Proceedings of the fifteenth annual ACM-SIAM symposium on Discrete algorithms
Effectively sharing a cache among threads

Proceedings of the sixteenth annual ACM symposium on Parallelism in algorithms and architectures
X10: an object-oriented approach to non-uniform cluster computing

OOPSLA '05 Proceedings of the 20th annual ACM SIGPLAN conference on Object-oriented programming, systems, languages, and applications
The cache complexity of multithreaded cache oblivious algorithms

Proceedings of the eighteenth annual ACM symposium on Parallelism in algorithms and architectures
Sequoia: programming the memory hierarchy

Proceedings of the 2006 ACM/IEEE conference on Supercomputing
Sequoia: programming the memory hierarchy

Proceedings of the 2006 ACM/IEEE conference on Supercomputing
Engineering a cache-oblivious sorting algorithm

Journal of Experimental Algorithmics (JEA)
Models for parallel and hierarchical computation

Proceedings of the 4th international conference on Computing frontiers
Optimal sparse matrix dense vector multiplication in the I/O-model

Proceedings of the nineteenth annual ACM symposium on Parallel algorithms and architectures
The cache-oblivious gaussian elimination paradigm: theoretical framework, parallelization and experimental evaluation

Proceedings of the nineteenth annual ACM symposium on Parallel algorithms and architectures
Provably good multicore cache performance for divide-and-conquer algorithms

Proceedings of the nineteenth annual ACM-SIAM symposium on Discrete algorithms
A consistency architecture for hierarchical shared caches

Proceedings of the twentieth annual symposium on Parallelism in algorithms and architectures
Fundamental parallel algorithms for private-cache chip multiprocessors

Proceedings of the twentieth annual symposium on Parallelism in algorithms and architectures
Cache-efficient dynamic programming algorithms for multicores

Proceedings of the twentieth annual symposium on Parallelism in algorithms and architectures
A unified model for multicore architectures

IFMT '08 Proceedings of the 1st international forum on Next-generation multicore/manycore technologies
Brief announcement: low depth cache-oblivious sorting

Proceedings of the twenty-first annual symposium on Parallelism in algorithms and architectures
Cache-aware and cache-oblivious adaptive sorting

ICALP'05 Proceedings of the 32nd international conference on Automata, Languages and Programming

Parallel approximation algorithms for facility-location problems

Proceedings of the twenty-second annual ACM symposium on Parallelism in algorithms and architectures
Hierarchical Diagonal Blocking and Precision Reduction Applied to Combinatorial Multigrid

Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
Brief announcement: paging for multicore processors

Proceedings of the twenty-third annual ACM symposium on Parallelism in algorithms and architectures
Scheduling irregular parallel computations on hierarchical caches

Proceedings of the twenty-third annual ACM symposium on Parallelism in algorithms and architectures
Balance principles for algorithm-architecture co-design

HotPar'11 Proceedings of the 3rd USENIX conference on Hot topic in parallelism
Paging for multi-core shared caches

Proceedings of the 3rd Innovations in Theoretical Computer Science Conference
Internally deterministic parallel algorithms can be fast

Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming
Brief announcement: efficient cache oblivious algorithms for randomized divide-and-conquer on the multicore model

Proceedings of the twenty-fourth annual ACM symposium on Parallelism in algorithms and architectures
Parallel and I/O efficient set covering algorithms

Proceedings of the twenty-fourth annual ACM symposium on Parallelism in algorithms and architectures
Brief announcement: towards a communication optimal fast multipole method and its implications at exascale

Proceedings of the twenty-fourth annual ACM symposium on Parallelism in algorithms and architectures
A parallel buffer tree

Proceedings of the twenty-fourth annual ACM symposium on Parallelism in algorithms and architectures
Can traditional programming bridge the Ninja performance gap for parallel computing applications?

Proceedings of the 39th Annual International Symposium on Computer Architecture
Computational geometry in the parallel external memory model

SIGSPATIAL Special
On the bit-complexity of sparse polynomial and series multiplication

Journal of Symbolic Computation
Program-centric cost models for locality

Proceedings of the ACM SIGPLAN Workshop on Memory Systems Performance and Correctness
Parallel triangle counting in massive streaming graphs

Proceedings of the 22nd ACM international conference on Conference on information & knowledge management

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper we explore a simple and general approach for developing parallel algorithms that lead to good cache complexity on parallel machines with private or shared caches. The approach is to design nested-parallel algorithms that have low depth (span, critical path length) and for which the natural sequential evaluation order has low cache complexity in the cache-oblivious model. We describe several cache-oblivious algorithms with optimal work, polylogarithmic depth, and sequential cache complexities that match the best sequential algorithms, including the first such algorithms for sorting and for sparse-matrix vector multiply on matrices with good vertex separators. Using known mappings, our results lead to low cache complexities on shared-memory multiprocessors with a single level of private caches or a single shared cache. We generalize these mappings to multi-level cache hierarchies of private or shared caches, implying that our algorithms also have low cache complexities on such hierarchies. The key factor in obtaining these low parallel cache complexities is the low depth of the algorithms we propose.