Advanced compiler optimizations for supercomputers
Communications of the ACM - Special issue on parallelism
Deterministic coin tossing with applications to optimal parallel list ranking
Information and Control
STOC '86 Proceedings of the eighteenth annual ACM symposium on Theory of computing
A simple parallel algorithm for the maximal independent set problem
STOC '85 Proceedings of the seventeenth annual ACM symposium on Theory of computing
Deterministic parallel list ranking
VLSI Algorithms and Architectures
Optimal parallel evaluation of tree-structured computations by raking (extended abstract)
VLSI Algorithms and Architectures
A simple parallel tree contraction algorithm
Journal of Algorithms
Faster optimal parallel prefix sums and list ranking
Information and Computation
Computer architecture: a quantitative approach
Computer architecture: a quantitative approach
A bridging model for parallel computation
Communications of the ACM
A simple randomized parallel algorithm for list-ranking
Information Processing Letters
Scan primitives for vector computers
Proceedings of the 1990 ACM/IEEE conference on Supercomputing
Parallel tree contraction part 2: further applications
SIAM Journal on Computing
Radix sort for vector multiprocessors
Proceedings of the 1991 ACM/IEEE conference on Supercomputing
Vector performance estimation for CRAY X-MP/Y-MP supercomputers
The Journal of Supercomputing
Solving Linear Recurrences with Loop Raking
IPPS '92 Proceedings of the 6th International Parallel Processing Symposium
List Ranking and List Scan on the CRAY C-90
List Ranking and List Scan on the CRAY C-90
Accounting for memory bank contention and delay in high-bandwidth multiprocessors
Proceedings of the seventh annual ACM symposium on Parallel algorithms and architectures
Better trade-offs for parallel list ranking
Proceedings of the ninth annual ACM symposium on Parallel algorithms and architectures
Scalable Parallel Implementations of List Ranking on Fine-Grained Machines
IEEE Transactions on Parallel and Distributed Systems
Experiments with list ranking for explicit multi-threaded (XMT) instruction parallelism
Journal of Experimental Algorithmics (JEA)
Experiments with List Ranking for Explicit Multi-Threaded (XMT) Instruction Parallelism
WAE '99 Proceedings of the 3rd International Workshop on Algorithm Engineering
Portable List Ranking: An Experimental Study
WAE '00 Proceedings of the 4th International Workshop on Algorithm Engineering
Using PRAM Algorithms on a Uniform-Memory-Access Shared-Memory Architecture
WAE '01 Proceedings of the 5th International Workshop on Algorithm Engineering
Handling Graphs According to a Coarse Grained Approach: Experiments with PVM and MPI
Proceedings of the 7th European PVM/MPI Users' Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface
Portable list ranking: an experimental study
Journal of Experimental Algorithmics (JEA)
Fast and scalable list ranking on the GPU
Proceedings of the 23rd international conference on Supercomputing
Hi-index | 0.00 |
List ranking and list scan are two primitive operations used in many parallel algorithms that use list, trees, and graph data structures. But vectorizing and parallelizing list ranking is a challenge because it is highly communication intensive and dynamic. In addition, the serial algorithm is very simple and has very small constants. In order to compete, a parallel algorithm must also be simple and have small constants. A parallel algorithm due to Wyllie is such an algorithm, but it is not work efficient—its performance degrades for longer and longer linked lists. In contrast, work efficient PRAM algorithms developed to date have very large constants. It does not achieve O(log n) running time, but we contend that work efficiency and small constants is more important, given that vector and multiprocessor machines are used for problems that are much larger than the number of processors and, therefore, the O(log n) running time, but we contend that work efficiency and small constants is more important, given that vector and multiprocessor machines are used for problems that are much larger than the number of processors and, therefore, the O(log n) time is never achieved in practice. In particular, to the best of our knowledge, our implementation of list ranking and list scan on the CRAY C-90 is the fastest implementation to date. In addition, it is the first implementation of which we are aware that outperforms fast workstations. The success of our algorithm is due to its relatively large grain size and simplicity of the inner loops, and the success of the implementation is due to pipelining reads and writes through vectorization to hide latency, minimizing load balancing by deriving equations for predicting and optimizing performance, and avoiding conditional tests except when load balancing.