Effectively sharing a cache among threads

Authors:
Guy E. Blelloch;Phillip B. Gibbons
Affiliations:
Carnegie Mellon University;Intel Research Pittsburgh
Venue:
Proceedings of the sixteenth annual ACM symposium on Parallelism in algorithms and architectures
Year:
2004

Citing 24
Cited 20

Amortized efficiency of list update and paging rules

Communications of the ACM
Footprints in the cache

ACM Transactions on Computer Systems (TOCS)
An analytical cache model

ACM Transactions on Computer Systems (TOCS)
Analysis of multithreaded architectures for parallel computing

SPAA '90 Proceedings of the second annual ACM symposium on Parallel algorithms and architectures
Improving Disk Cache Hit-Ratios Through Cache Partitioning

IEEE Transactions on Computers
Simultaneous multithreading: maximizing on-chip parallelism

ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
An analysis of dag-consistent distributed shared-memory algorithms

Proceedings of the eighth annual ACM symposium on Parallel algorithms and architectures
Space-efficient scheduling of parallelism with synchronization variables

Proceedings of the ninth annual ACM symposium on Parallel algorithms and architectures
Provably efficient scheduling for languages with fine-grained parallelism

Journal of the ACM (JACM)
Space-efficient scheduling of nested parallelism

ACM Transactions on Programming Languages and Systems (TOPLAS)
Scheduling multithreaded computations by work stealing

Journal of the ACM (JACM)
Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers

ISCA '90 Proceedings of the 17th annual international symposium on Computer Architecture
Piranha: a scalable architecture based on single-chip multiprocessing

Proceedings of the 27th annual international symposium on Computer architecture
Application-Controlled Paging for a Shared Cache

SIAM Journal on Computing
Analytical cache models with applications to cache partitioning

ICS '01 Proceedings of the 15th international conference on Supercomputing
Towards a first vertical prototyping of an extremely fine-grained parallel programming approach

Proceedings of the thirteenth annual ACM symposium on Parallel algorithms and architectures
Low-contention depth-first scheduling of parallel computations with write-once synchronization variables

Proceedings of the thirteenth annual ACM symposium on Parallel algorithms and architectures
Cache-oblivious priority queue and graph algorithm applications

STOC '02 Proceedings of the thiry-fourth annual ACM symposium on Theory of computing
A Single-Chip Multiprocessor

Computer
The Stanford Hydra CMP

IEEE Micro
The MAJC Architecture: A Synthesis of Parallelism and Scalability

IEEE Micro
Competitive Analysis of Paging

Developments from a June 1996 seminar on Online algorithms: the state of the art
Cache-Oblivious Algorithms

FOCS '99 Proceedings of the 40th Annual Symposium on Foundations of Computer Science
Dynamic Partitioning of Shared Cache Memory

The Journal of Supercomputing

Scheduling Algorithms for Effective Thread Pairing on Hybrid Multiprocessors

IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Papers - Volume 01
The cache complexity of multithreaded cache oblivious algorithms

Proceedings of the eighteenth annual ACM symposium on Parallelism in algorithms and architectures
Parallel depth first vs. work stealing schedulers on CMP architectures

Proceedings of the eighteenth annual ACM symposium on Parallelism in algorithms and architectures
The cache-oblivious gaussian elimination paradigm: theoretical framework, parallelization and experimental evaluation

Proceedings of the nineteenth annual ACM symposium on Parallel algorithms and architectures
Scheduling threads for constructive cache sharing on CMPs

Proceedings of the nineteenth annual ACM symposium on Parallel algorithms and architectures
Provably good multicore cache performance for divide-and-conquer algorithms

Proceedings of the nineteenth annual ACM-SIAM symposium on Discrete algorithms
Cache-efficient dynamic programming algorithms for multicores

Proceedings of the twentieth annual symposium on Parallelism in algorithms and architectures
Exploiting multithreaded architectures to improve the hash join operation

Proceedings of the 9th workshop on MEmory performance: DEaling with Applications, systems and architecture
Brief announcement: low depth cache-oblivious sorting

Proceedings of the twenty-first annual symposium on Parallelism in algorithms and architectures
Compositional, Dynamic Cache Management for Embedded Chip Multiprocessors

Journal of Signal Processing Systems
Contention aware execution: online contention detection and response

Proceedings of the 8th annual IEEE/ACM international symposium on Code generation and optimization
Low depth cache-oblivious algorithms

Proceedings of the twenty-second annual ACM symposium on Parallelism in algorithms and architectures
Hierarchical Diagonal Blocking and Precision Reduction Applied to Combinatorial Multigrid

Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
Brief announcement: paging for multicore processors

Proceedings of the twenty-third annual ACM symposium on Parallelism in algorithms and architectures
Scheduling irregular parallel computations on hierarchical caches

Proceedings of the twenty-third annual ACM symposium on Parallelism in algorithms and architectures
The impact of memory subsystem resource sharing on datacenter applications

Proceedings of the 38th annual international symposium on Computer architecture
Loaf: a framework and infrastructure for creating online adaptive solutions

Proceedings of the 1st International Workshop on Adaptive Self-Tuning Computing Systems for the Exaflop Era
Paging for multi-core shared caches

Proceedings of the 3rd Innovations in Theoretical Computer Science Conference
Cache-efficient parallel isosurface extraction for shared cache multicores

EG PGV'10 Proceedings of the 10th Eurographics conference on Parallel Graphics and Visualization
Program-centric cost models for locality

Proceedings of the ACM SIGPLAN Workshop on Memory Systems Performance and Correctness

Quantified Score

Hi-index	0.00

Visualization

Abstract

We compare the number of cache misses M1 for running a computation on a single processor with cache size C1 to the total number of misses Mp for the same computation when using p processors or threads and a shared cache of size Cp. We show that for any computation, and with an appropriate (greedy) parallel schedule, if Cp ≥ C1 + pd then Mp ≤ M1. The depth d of the computation is the length of the critical path of dependences. This gives the perhaps surprising result that for sufficiently parallel computations the shared cache need only be an additive size larger than the single-processor cache, and gives some theoretical justification for designing machines with shared caches.We model a computation as a DAG and the sequential execution as a depth first schedule of the DAG. The parallel schedule we study is a parallel depth-first schedule (PDF schedule) based on the sequential one. The schedule is greedy and therefore work-efficient. Our main results assume the Ideal Cache model, but we also present results for other more realistic cache models.