The cache complexity of multithreaded cache oblivious algorithms

Authors:
Matteo Frigo;Volker Strumpen
Affiliations:
IBM Austin Research Laboratory, Austin, TX;IBM Austin Research Laboratory, Austin, TX
Venue:
Proceedings of the eighteenth annual ACM symposium on Parallelism in algorithms and architectures
Year:
2006

Citing 18
Cited 12

Amortized efficiency of list update and paging rules

Communications of the ACM
A bridging model for parallel computation

Communications of the ACM
LogP: towards a realistic model of parallel computation

PPOPP '93 Proceedings of the fourth ACM SIGPLAN symposium on Principles and practice of parallel programming
Cilk: an efficient multithreaded runtime system

PPOPP '95 Proceedings of the fifth ACM SIGPLAN symposium on Principles and practice of parallel programming
Upper bounds to processor-time tradeoffs under bounded-speed message propagation

Proceedings of the seventh annual ACM symposium on Parallel algorithms and architectures
An analysis of dag-consistent distributed shared-memory algorithms

Proceedings of the eighth annual ACM symposium on Parallel algorithms and architectures
The implementation of the Cilk-5 multithreaded language

PLDI '98 Proceedings of the ACM SIGPLAN 1998 conference on Programming language design and implementation
Provably efficient scheduling for languages with fine-grained parallelism

Journal of the ACM (JACM)
Space-efficient scheduling of nested parallelism

ACM Transactions on Programming Languages and Systems (TOPLAS)
The Parallel Evaluation of General Arithmetic Expressions

Journal of the ACM (JACM)
Scheduling multithreaded computations by work stealing

Journal of the ACM (JACM)
Cache-Oblivious Algorithms

FOCS '99 Proceedings of the 40th Annual Symposium on Foundations of Computer Science
I/O complexity: The red-blue pebble game

STOC '81 Proceedings of the thirteenth annual ACM symposium on Theory of computing
Parallelism in random access machines

STOC '78 Proceedings of the tenth annual ACM symposium on Theory of computing
Effectively sharing a cache among threads

Proceedings of the sixteenth annual ACM symposium on Parallelism in algorithms and architectures
Concurrent cache-oblivious b-trees

Proceedings of the seventeenth annual ACM symposium on Parallelism in algorithms and architectures
Cache oblivious stencil computations

Proceedings of the 19th annual international conference on Supercomputing
Cache-oblivious dynamic programming

SODA '06 Proceedings of the seventeenth annual ACM-SIAM symposium on Discrete algorithm

The cache-oblivious gaussian elimination paradigm: theoretical framework, parallelization and experimental evaluation

Proceedings of the nineteenth annual ACM symposium on Parallel algorithms and architectures
Provably good multicore cache performance for divide-and-conquer algorithms

Proceedings of the nineteenth annual ACM-SIAM symposium on Discrete algorithms
Cache-efficient dynamic programming algorithms for multicores

Proceedings of the twentieth annual symposium on Parallelism in algorithms and architectures
Brief announcement: low depth cache-oblivious sorting

Proceedings of the twenty-first annual symposium on Parallelism in algorithms and architectures
Cache oblivious parallelograms in iterative stencil computations

Proceedings of the 24th ACM International Conference on Supercomputing
The Cilkview scalability analyzer

Proceedings of the twenty-second annual ACM symposium on Parallelism in algorithms and architectures
Low depth cache-oblivious algorithms

Proceedings of the twenty-second annual ACM symposium on Parallelism in algorithms and architectures
Time skewing made simple

Proceedings of the 16th ACM symposium on Principles and practice of parallel programming
Scheduling irregular parallel computations on hierarchical caches

Proceedings of the twenty-third annual ACM symposium on Parallelism in algorithms and architectures
Performance analysis and optimization of molecular dynamics simulation on Godson-T many-core processor

Proceedings of the 8th ACM International Conference on Computing Frontiers
Patus for convenient high-performance stencils: evaluation in earthquake simulations

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Scalability study of molecular dynamics simulation on Godson-T many-core architecture

Journal of Parallel and Distributed Computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

We present a technique for analyzing the number of cache misses incurred by multithreaded cache oblivious algorithms on an idealized parallel machine in which each processor has a private cache. We specialize this technique to computations executed by the Cilk work-stealing scheduler on a machine with dag-consistent shared memory. We show that a multithreaded cache oblivious matrix multiplication incurs O(n3/√Z + (Pn)1/3n2) cache misses when executed by the Cilk scheduler on a machine with P processors, each with a cache of size Z, with high probability. This bound is tighter than previously published bounds. We also present a new multithreaded cache oblivious algorithm for 1D stencil computations, which incurs O(n2/Z+n+√Pn3+ε) cache misses with high probability.