Fundamental parallel algorithms for private-cache chip multiprocessors

Authors:
Lars Arge;Michael T. Goodrich;Michael Nelson;Nodari Sitchinava
Affiliations:
University of Aarhus, Aarhus, Denmark;University of California, Irvine, Irvine, CA, USA;University of California, Irvine, Irvine, CA, USA;University of California, Irvine, Irvine, CA, USA
Venue:
Proceedings of the twentieth annual symposium on Parallelism in algorithms and architectures
Year:
2008

Citing 26
Cited 25

Upper and lower time bounds for parallel random access machines without simultaneous writes

SIAM Journal on Computing
The input/output complexity of sorting and related problems

Communications of the ACM
Parallel merge sort

SIAM Journal on Computing
A bridging model for parallel computation

Communications of the ACM
Optimal disk I/O with parallel block transfer

STOC '90 Proceedings of the twenty-second annual ACM symposium on Theory of computing
Parallel algorithms for shared-memory machines

Handbook of theoretical computer science (vol. A)
An introduction to parallel algorithms

An introduction to parallel algorithms
LogP: towards a realistic model of parallel computation

PPOPP '93 Proceedings of the fourth ACM SIGPLAN symposium on Principles and practice of parallel programming
Large-scale sorting in uniform memory hierarchies

Journal of Parallel and Distributed Computing - Special issue on parallel I/O systems
Deterministic distribution sort in shared and distributed memory multiprocessors

SPAA '93 Proceedings of the fifth annual ACM symposium on Parallel algorithms and architectures
Optimal broadcast and summation in the LogP model

SPAA '93 Proceedings of the fifth annual ACM symposium on Parallel algorithms and architectures
Greed sort: optimal deterministic sorting on parallel disks

Journal of the ACM (JACM)
Sorting on a parallel pointer machine with applications to set expression evaluation

Journal of the ACM (JACM)
Deterministic sorting and randomized median finding on the BSP model

Proceedings of the eighth annual ACM symposium on Parallel algorithms and architectures
A bridging model for parallel computation, communication, and I/O

ACM Computing Surveys (CSUR) - Special issue: position statements on strategic directions in computing research
External-memory graph algorithms

Proceedings of the sixth annual ACM-SIAM symposium on Discrete algorithms
Communication-Efficient Parallel Sorting

SIAM Journal on Computing
Synthesis of Parallel Algorithms

Synthesis of Parallel Algorithms
A structural theory of recursively decomposable parallel processor-networks

SPDP '95 Proceedings of the 7th IEEE Symposium on Parallel and Distributeed Processing
A PRAM-on-Chip Vision (invited abstract)

SPIRE '00 Proceedings of the Seventh International Symposium on String Processing Information Retrieval (SPIRE'00)
Industry Trends: Chip Makers Turn to Multicore Processors

Computer
Concurrent cache-oblivious b-trees

Proceedings of the seventeenth annual ACM symposium on Parallelism in algorithms and architectures
Multi-Core to the Masses

Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques
Scheduling threads for constructive cache sharing on CMPs

Proceedings of the nineteenth annual ACM symposium on Parallel algorithms and architectures
Provably good multicore cache performance for divide-and-conquer algorithms

Proceedings of the nineteenth annual ACM-SIAM symposium on Discrete algorithms
External-memory computational geometry

SFCS '93 Proceedings of the 1993 IEEE 34th Annual Foundations of Computer Science

Algorithms and data structures for external memory

Foundations and Trends® in Theoretical Computer Science
A Bridging Model for Multi-core Computing

ESA '08 Proceedings of the 16th annual European symposium on Algorithms
Brief announcement: low depth cache-oblivious sorting

Proceedings of the twenty-first annual symposium on Parallelism in algorithms and architectures
Towards optimizing energy costs of algorithms for shared memory architectures

Proceedings of the twenty-second annual ACM symposium on Parallelism in algorithms and architectures
Low depth cache-oblivious algorithms

Proceedings of the twenty-second annual ACM symposium on Parallelism in algorithms and architectures
Resource oblivious sorting on multicores

ICALP'10 Proceedings of the 37th international colloquium conference on Automata, languages and programming
Geometric algorithms for private-cache chip multiprocessors

ESA'10 Proceedings of the 18th annual European conference on Algorithms: Part II
A bridging model for multi-core computing

Journal of Computer and System Sciences
Algorithm engineering: bridging the gap between algorithm theory and practice

Algorithm engineering: bridging the gap between algorithm theory and practice
Scheduling irregular parallel computations on hierarchical caches

Proceedings of the twenty-third annual ACM symposium on Parallelism in algorithms and architectures
An optimal hidden-surface algorithm and its parallelization

ICCSA'11 Proceedings of the 2011 international conference on Computational science and its applications - Volume Part III
Modeling the energy consumption for concurrent executions of parallel tasks

Proceedings of the 14th Communications and Networking Symposium
Paging for multi-core shared caches

Proceedings of the 3rd Innovations in Theoretical Computer Science Conference
Multi-DaC programming model: a variant of multi-BSP model for divide-and-conquer algorithms

DAMP '12 Proceedings of the 7th workshop on Declarative aspects and applications of multicore programming
The efficiency of mapreduce in parallel external memory

LATIN'12 Proceedings of the 10th Latin American international conference on Theoretical Informatics
Brief announcement: efficient cache oblivious algorithms for randomized divide-and-conquer on the multicore model

Proceedings of the twenty-fourth annual ACM symposium on Parallelism in algorithms and architectures
A parallel buffer tree

Proceedings of the twenty-fourth annual ACM symposium on Parallelism in algorithms and architectures
Computational geometry in the parallel external memory model

SIGSPATIAL Special
Analytical modeling and simulation of the energy consumption of independent tasks

Proceedings of the Winter Simulation Conference
Program-centric cost models for locality

Proceedings of the ACM SIGPLAN Workshop on Memory Systems Performance and Correctness
On (dynamic) range minimum queries in external memory

WADS'13 Proceedings of the 13th international conference on Algorithms and Data Structures
Tight bounds for low dimensional star stencils in the external memory model

WADS'13 Proceedings of the 13th international conference on Algorithms and Data Structures
Maximizing the performance of irregular applications on multithreaded, NUMA systems

IA^3 '13 Proceedings of the 3rd Workshop on Irregular Applications: Architectures and Algorithms
A memory access model for highly-threaded many-core architectures

Future Generation Computer Systems
Measurement of the latency parameters of the Multi-BSP model: a multicore benchmarking approach

The Journal of Supercomputing

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper, we study parallel algorithms for private-cache chip multiprocessors (CMPs), focusing on methods for foundational problems that are scalable with the number of cores. By focusing on private-cache CMPs, we show that we can design efficient algorithms that need no additional assumptions about the way cores are interconnected, for we assume that all inter-processor communication occurs through the memory hierarchy. We study several fundamental problems, including prefix sums, selection, and sorting, which often form the building blocks of other parallel algorithms. Indeed, we present two sorting algorithms, a distribution sort and a mergesort. Our algorithms are asymptotically optimal in terms of parallel cache accesses and space complexity under reasonable assumptions about the relationships between the number of processors, the size of memory, and the size of cache blocks. In addition, we study sorting lower bounds in a computational model, which we call the parallel external-memory (PEM) model, that formalizes the essential properties of our algorithms for private-cache CMPs.