Scans as Primitive Parallel Operations

Authors:
G. E. Blelloch
Affiliations:
-
Venue:
IEEE Transactions on Computers
Year:
1989

Citing 21
Cited 98

The cosmic cube

Communications of the ACM - Special section on computer architecture
Data structures and network algorithms

Data structures and network algorithms
Principles of interactive computer graphics (2nd ed.)

Principles of interactive computer graphics (2nd ed.)
An O(n2 log n) parallel max-flow algorithm

Journal of Algorithms
The connection machine

The connection machine
Random-Access Stored-Program Machines, an Approach to Programming Languages

Journal of the ACM (JACM)
Parallel Prefix Computation

Journal of the ACM (JACM)
A universal interconnection pattern for parallel computers

Journal of the ACM (JACM)
Ultracomputers

ACM Transactions on Programming Languages and Systems (TOPLAS)
Basic Techniques for the Efficient Coordination of Very Large Numbers of Cooperating Sequential Processors

ACM Transactions on Programming Languages and Systems (TOPLAS)
Computing connected components on parallel computers

Communications of the ACM
Sorting on a mesh-connected parallel computer

Communications of the ACM
Merging with parallel processors

Communications of the ACM
Parallelism in random access machines

STOC '78 Proceedings of the tenth annual ACM symposium on Theory of computing
The chip complexity of binary arithmetic

STOC '80 Proceedings of the twelfth annual ACM symposium on Theory of computing
Tight bounds on the complexity of parallel sorting

STOC '84 Proceedings of the sixteenth annual ACM symposium on Theory of computing
An 0(n log n) sorting network

STOC '83 Proceedings of the fifteenth annual ACM symposium on Theory of computing
New bounds for parallel prefix circuits

STOC '83 Proceedings of the fifteenth annual ACM symposium on Theory of computing
The Complexity of Parallel Computations

The Complexity of Parallel Computations
Fluent parallel computation

Fluent parallel computation
A programming language

A programming language

A functional programming language compiler for massively parallel computers

LFP '90 Proceedings of the 1990 ACM conference on LISP and functional programming
Program optimization and parallelization using idioms

POPL '91 Proceedings of the 18th ACM SIGPLAN-SIGACT symposium on Principles of programming languages
Parallel programming with coordination structures

POPL '91 Proceedings of the 18th ACM SIGPLAN-SIGACT symposium on Principles of programming languages
Scan primitives for vector computers

Proceedings of the 1990 ACM/IEEE conference on Supercomputing
Design and performance evaluation of new massively parallel VLSI mask verification algorithms in JIGSAW

DAC '90 Proceedings of the 27th ACM/IEEE Design Automation Conference
Radix sort for vector multiprocessors

Proceedings of the 1991 ACM/IEEE conference on Supercomputing
Distributed computing with APL

APL '92 Proceedings of the international conference on APL
An equational language for data-parallelism

PPOPP '93 Proceedings of the fourth ACM SIGPLAN symposium on Principles and practice of parallel programming
Disseminating critical target-specific synchronization information in parallel discrete event simulations

PADS '93 Proceedings of the seventh workshop on Parallel and distributed simulation
Program optimization and parallelization using idioms

ACM Transactions on Programming Languages and Systems (TOPLAS)
Parallelizing complex scans and reductions

PLDI '94 Proceedings of the ACM SIGPLAN 1994 conference on Programming language design and implementation
A comparison of parallel algorithms for connected components

SPAA '94 Proceedings of the sixth annual ACM symposium on Parallel algorithms and architectures
SIMD instruction cache

SPAA '94 Proceedings of the sixth annual ACM symposium on Parallel algorithms and architectures
Efficient low-contention parallel algorithms

SPAA '94 Proceedings of the sixth annual ACM symposium on Parallel algorithms and architectures
Parallel solutions to geometric problems in the scan model of computation

Journal of Computer and System Sciences
Request Combining in Multiprocessors with Arbitrary Interconnection Networks

IEEE Transactions on Parallel and Distributed Systems
Are multiport memories physically feasible?

ACM SIGARCH Computer Architecture News - Special issue on input/output in parallel computer systems
Are multiport memories physically feasible?

ACM SIGARCH Computer Architecture News
Compiler transformations for high-performance computing

ACM Computing Surveys (CSUR)
Prefix Computations on a Generalized Mesh-Connected Computer with Multiple Buses

IEEE Transactions on Parallel and Distributed Systems
Flattening and parallelizing irregular, recurrent loop nests

PPOPP '95 Proceedings of the fifth ACM SIGPLAN symposium on Principles and practice of parallel programming
Thoughts on parallelism and concurrency in compiling curricula

ACM Computing Surveys (CSUR)
Empirical study of parallel trace-driven LRU cache simulators

PADS '95 Proceedings of the ninth workshop on Parallel and distributed simulation
Transformation of functional specifications of finite difference methods to parallel distributed codes

PACT '95 Proceedings of the IFIP WG10.3 working conference on Parallel architectures and compilation techniques
Matrix inversion in O(log n) on a scan-enhanced reconfigurable mesh computer

CSC '96 Proceedings of the 1996 ACM 24th annual conference on Computer science
Superfast parallel discrete event simulations

ACM Transactions on Modeling and Computer Simulation (TOMACS)
Detection and global optimization of reduction operations for distributed parallel machines

ICS '96 Proceedings of the 10th international conference on Supercomputing
Pipelining with futures

Proceedings of the ninth annual ACM symposium on Parallel algorithms and architectures
Deriving efficient parallel programs for complex recurrences

PASCO '97 Proceedings of the second international symposium on Parallel symbolic computation
The Reconfigurable Ring of Processors: Fine-Grain Tree-Structured Computations

IEEE Transactions on Computers
Parallelization in calculational forms

POPL '98 Proceedings of the 25th ACM SIGPLAN-SIGACT symposium on Principles of programming languages
The Static Parallelization of Loops and Recursions

The Journal of Supercomputing - Special issue: high performance computing systems
Fast set operations using treaps

Proceedings of the tenth annual ACM symposium on Parallel algorithms and architectures
Communication-optimal parallel minimum spanning tree algorithms (extended abstract)

Proceedings of the tenth annual ACM symposium on Parallel algorithms and architectures
Implementation of reductions in support of PDES on a network of workstations

PADS '98 Proceedings of the twelfth workshop on Parallel and distributed simulation
Provably efficient scheduling for languages with fine-grained parallelism

Journal of the ACM (JACM)
Using Emulations to Enhance the Performance of Parallel Architectures

IEEE Transactions on Parallel and Distributed Systems
A New Class of Depth-Size Optimal Parallel Prefix Circuits

The Journal of Supercomputing
Scalable Hardware-Algorithms for Binary Prefix Sums

IEEE Transactions on Parallel and Distributed Systems
Constructing H4, a Fast Depth-Size Optimal Parallel Prefix Circuit

The Journal of Supercomputing
An Optimal Implementation of Broadcasting with Selective Reduction

IEEE Transactions on Parallel and Distributed Systems
A Parallel Algorithm for Random Walk Construction with Application to the Monte Carlo Solution of Partial Differential Equations

IEEE Transactions on Parallel and Distributed Systems
Load Balancing Requirements in Parallel Implementations of Image Feature Extraction Tasks

IEEE Transactions on Parallel and Distributed Systems
A Family of Parallel Prefix Algorithms Embedded in Networks

IEEE Transactions on Parallel and Distributed Systems
Concurrent Processing of Linearly Ordered Data Structures on Hypercube Multicomputers

IEEE Transactions on Parallel and Distributed Systems
Unstructured Tree Search on SIMD Parallel Computers

IEEE Transactions on Parallel and Distributed Systems
An Accumulative Parallel Skeleton for All

ESOP '02 Proceedings of the 11th European Symposium on Programming Languages and Systems
Optimal Segmented Scan and Simulation of Reconfigurable Architectures on Fixed Connection Networks

HiPC '00 Proceedings of the 7th International Conference on High Performance Computing
On Parallel Reconfigurable Architectures for Image Processing

ParNum '99 Proceedings of the 4th International ACPC Conference Including Special Tracks on Parallel Numerics and Parallel Computing in Image Processing, Video Processing, and Multimedia: Parallel Computation
Declarative definition of group indexed data structures and approximation of their domains

Proceedings of the 3rd ACM SIGPLAN international conference on Principles and practice of declarative programming
SAT: a programming methodology with skeletons and collective operations

Patterns and skeletons for parallel and distributed computing
Program transformations and skeletons: formal derivation of parallel programs

PAS '95 Proceedings of the First Aizu International Symposium on Parallel Algorithms/Architecture Synthesis
Z4: a new depth-size optimal parallel prefix circuit with small depth

Neural, Parallel & Scientific Computations
Data-parallel polygonization

Parallel Computing - Special issue: High performance computing with geographical data
Parallelizing functional programs by generalization

Journal of Functional Programming
Parallelization of divide-and-conquer by translation to nested loops

Journal of Functional Programming
A new approach to constructing optimal parallel prefix circuits with small depth

Journal of Parallel and Distributed Computing
Efficient parallel solutions of linear algebraic circuits

Journal of Parallel and Distributed Computing
Time and work optimal simulation of basic reconfigurable meshes on hypercubes

Journal of Parallel and Distributed Computing
A new parallel skeleton for general accumulative computations

International Journal of Parallel Programming
Faster optimal parallel prefix circuits: New algorithmic construction

Journal of Parallel and Distributed Computing
A library of constructive skeletons for sequential style of parallel programming

InfoScale '06 Proceedings of the 1st international conference on Scalable information systems
Parallel skeletons for manipulating general trees

Parallel Computing - Algorithmic skeletons
Automatic inversion generates divide-and-conquer parallel programs

Proceedings of the 2007 ACM SIGPLAN conference on Programming language design and implementation
MapReduce: simplified data processing on large clusters

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Scout: a data-parallel programming language for graphics processors

Parallel Computing
MapReduce: simplified data processing on large clusters

Communications of the ACM - 50th anniversary issue: 1958 - 2008
Supporting tasks with adaptive groups in data parallel programming

International Journal of Computational Science and Engineering
Computation-efficient parallel prefix

AIC'06 Proceedings of the 6th WSEAS International Conference on Applied Informatics and Communications
Straightforward construction of depth-size optimal, parallel prefix circuits with fan-out 2

ACM Transactions on Design Automation of Electronic Systems (TODAES)
Parallel prefix algorithms on the multicomputer

WSEAS Transactions on Computer Research
Fast problem-size-independent parallel prefix circuits

Journal of Parallel and Distributed Computing
Rigel: an architecture and scalable programming interface for a 1000-core accelerator

Proceedings of the 36th annual international symposium on Computer architecture
Fast minimum spanning tree for large graphs on the GPU

Proceedings of the Conference on High Performance Graphics 2009
New parallel prefix algorithms

AIC'09 Proceedings of the 9th WSEAS international conference on Applied informatics and communications
MapReduce System over Heterogeneous Mobile Devices

SEUS '09 Proceedings of the 7th IFIP WG 10.2 International Workshop on Software Technologies for Embedded and Ubiquitous Systems
New families of computation-efficient parallel prefix algorithms

WSEAS Transactions on Computers
Simple optimizations for an applicative array language for graphics processors

Proceedings of the sixth workshop on Declarative aspects of multicore programming
Xetal-II: A Low-Power Massively-Parallel Processor for Video Scene Analysis

Journal of Signal Processing Systems
A highly-parallel TSP solver for a GPU computing platform

NMA'10 Proceedings of the 7th international conference on Numerical methods and applications
GPU-efficient recursive filtering and summed-area tables

Proceedings of the 2011 SIGGRAPH Asia Conference
Static GPU threads and an improved scan algorithm

Euro-Par 2010 Proceedings of the 2010 conference on Parallel processing
Bringing back monad comprehensions

Proceedings of the 4th ACM symposium on Haskell
Scalable fast multipole methods on distributed heterogeneous architectures

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Parallel prefix (scan) algorithms for MPI

EuroPVM/MPI'06 Proceedings of the 13th European PVM/MPI User's Group conference on Recent advances in parallel virtual machine and message passing interface
Database-centric programming for wide-area sensor systems

DCOSS'05 Proceedings of the First IEEE international conference on Distributed Computing in Sensor Systems
Scalable GPU graph traversal

Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming
Data-parallel intra decoding for block-based image and video coding on massively parallel architectures

Image Communication
Automatic parallelization of recursive functions using quantifier elimination

FLOPS'10 Proceedings of the 10th international conference on Functional and Logic Programming
Design and implementation of 812: A declarative data-parallel language

Computer Languages
Scan detection and parallelization in "inherently sequential" nested loop programs

Proceedings of the Tenth International Symposium on Code Generation and Optimization
ManyLoDs: parallel many-view level-of-detail selection for real-time global illumination

EGSR'11 Proceedings of the Twenty-second Eurographics conference on Rendering
More IMPATIENT: A gridding-accelerated Toeplitz-based strategy for non-Cartesian high-resolution 3D MRI on GPUs

Journal of Parallel and Distributed Computing
Design of a low-energy data processing architecture for WSN nodes

DATE '12 Proceedings of the Conference on Design, Automation and Test in Europe
A T2 graph-reduction approach to fusion

Proceedings of the 2nd ACM SIGPLAN workshop on Functional high-performance computing
A sound and complete abstraction for reasoning about parallel prefix sums

Proceedings of the 41st ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages
yaSpMV: yet another SpMV framework on GPUs

Proceedings of the 19th ACM SIGPLAN symposium on Principles and practice of parallel programming
Development of an intelligent distributed news retrieval system

International Journal of Knowledge-based and Intelligent Engineering Systems

Quantified Score

Hi-index	14.98

Visualization

Abstract

A study of the effects of adding two scan primitives as unit-time primitives to PRAM (parallel random access machine) models is presented. It is shown that the primitives improve the asymptotic running time of many algorithms by an O(log n) factor, greatly simplifying the description of many algorithms, and are significantly easier to implement than memory references. It is argued that the algorithm designer should feel free to use these operations as if they were as cheap as a memory reference. The author describes five algorithms that clearly illustrate how the scan primitives can be used in algorithm design: a radix-sort algorithm, a quicksort algorithm, a minimum-spanning-tree algorithm, a line-drawing algorithm, and a merging algorithm. These all run on an EREW (exclusive read, exclusive write) PRAM with the addition of two scan primitives and are either simpler or more efficient than their pure PRAM counterparts. The scan primitives have been implemented in microcode on the Connection Machine system, are available in PARIS (the parallel instruction set of the machine).