List ranking and list scan on the Cray C-90

Authors:
Margaret Reid-Miller
Affiliations:
Carnegie Mellon Univ., Pittsburgh, PA
Venue:
SPAA '94 Proceedings of the sixth annual ACM symposium on Parallel algorithms and architectures
Year:
1994

Citing 17
Cited 10

Advanced compiler optimizations for supercomputers

Communications of the ACM - Special issue on parallelism
Deterministic coin tossing with applications to optimal parallel list ranking

Information and Control
Deterministic coin tossing and accelerating cascades: micro and macro techniques for designing parallel algorithms

STOC '86 Proceedings of the eighteenth annual ACM symposium on Theory of computing
A simple parallel algorithm for the maximal independent set problem

STOC '85 Proceedings of the seventeenth annual ACM symposium on Theory of computing
Deterministic parallel list ranking

VLSI Algorithms and Architectures
Optimal parallel evaluation of tree-structured computations by raking (extended abstract)

VLSI Algorithms and Architectures
A simple parallel tree contraction algorithm

Journal of Algorithms
Faster optimal parallel prefix sums and list ranking

Information and Computation
Computer architecture: a quantitative approach

Computer architecture: a quantitative approach
A bridging model for parallel computation

Communications of the ACM
A simple randomized parallel algorithm for list-ranking

Information Processing Letters
Scan primitives for vector computers

Proceedings of the 1990 ACM/IEEE conference on Supercomputing
Parallel tree contraction part 2: further applications

SIAM Journal on Computing
Radix sort for vector multiprocessors

Proceedings of the 1991 ACM/IEEE conference on Supercomputing
Vector performance estimation for CRAY X-MP/Y-MP supercomputers

The Journal of Supercomputing
Solving Linear Recurrences with Loop Raking

IPPS '92 Proceedings of the 6th International Parallel Processing Symposium
List Ranking and List Scan on the CRAY C-90

List Ranking and List Scan on the CRAY C-90

Accounting for memory bank contention and delay in high-bandwidth multiprocessors

Proceedings of the seventh annual ACM symposium on Parallel algorithms and architectures
Better trade-offs for parallel list ranking

Proceedings of the ninth annual ACM symposium on Parallel algorithms and architectures
Scalable Parallel Implementations of List Ranking on Fine-Grained Machines

IEEE Transactions on Parallel and Distributed Systems
Experiments with list ranking for explicit multi-threaded (XMT) instruction parallelism

Journal of Experimental Algorithmics (JEA)
Experiments with List Ranking for Explicit Multi-Threaded (XMT) Instruction Parallelism

WAE '99 Proceedings of the 3rd International Workshop on Algorithm Engineering
Portable List Ranking: An Experimental Study

WAE '00 Proceedings of the 4th International Workshop on Algorithm Engineering
Using PRAM Algorithms on a Uniform-Memory-Access Shared-Memory Architecture

WAE '01 Proceedings of the 5th International Workshop on Algorithm Engineering
Handling Graphs According to a Coarse Grained Approach: Experiments with PVM and MPI

Proceedings of the 7th European PVM/MPI Users' Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface
Portable list ranking: an experimental study

Journal of Experimental Algorithmics (JEA)
Fast and scalable list ranking on the GPU

Proceedings of the 23rd international conference on Supercomputing

Quantified Score

Hi-index	0.00

Visualization

Abstract

List ranking and list scan are two primitive operations used in many parallel algorithms that use list, trees, and graph data structures. But vectorizing and parallelizing list ranking is a challenge because it is highly communication intensive and dynamic. In addition, the serial algorithm is very simple and has very small constants. In order to compete, a parallel algorithm must also be simple and have small constants. A parallel algorithm due to Wyllie is such an algorithm, but it is not work efficient—its performance degrades for longer and longer linked lists. In contrast, work efficient PRAM algorithms developed to date have very large constants. It does not achieve O(log n) running time, but we contend that work efficiency and small constants is more important, given that vector and multiprocessor machines are used for problems that are much larger than the number of processors and, therefore, the O(log n) running time, but we contend that work efficiency and small constants is more important, given that vector and multiprocessor machines are used for problems that are much larger than the number of processors and, therefore, the O(log n) time is never achieved in practice. In particular, to the best of our knowledge, our implementation of list ranking and list scan on the CRAY C-90 is the fastest implementation to date. In addition, it is the first implementation of which we are aware that outperforms fast workstations. The success of our algorithm is due to its relatively large grain size and simplicity of the inner loops, and the success of the implementation is due to pipelining reads and writes through vectorization to hide latency, minimizing load balancing by deriving equations for predicting and optimizing performance, and avoiding conditional tests except when load balancing.