Communications of the ACM - Special issue on parallelism
ICS '88 Proceedings of the 2nd international conference on Supercomputing
Compiling techniques for first-order liner recurrences on a Vector computer
Proceedings of the 1988 ACM/IEEE conference on Supercomputing
Algorithmic Techniques for Computer Vision on a Fine-Grained Parallel Machine
IEEE Transactions on Pattern Analysis and Machine Intelligence
Scans as Primitive Parallel Operations
IEEE Transactions on Computers
Compiling collection-oriented languages onto massively parallel computers
Journal of Parallel and Distributed Computing - Massively parallel computation
Vector models for data-parallel computing
Vector models for data-parallel computing
A Fast Direct Solution of Poisson's Equation Using Fourier Analysis
Journal of the ACM (JACM)
Journal of the ACM (JACM)
Parallel Tridiagonal Equation Solvers
ACM Transactions on Mathematical Software (TOMS)
Basic Linear Algebra Subprograms for Fortran Usage
ACM Transactions on Mathematical Software (TOMS)
Parallel Programming and Compilers
Parallel Programming and Compilers
A Guidebook to FORTRAN on Supercomputers
A Guidebook to FORTRAN on Supercomputers
Parallel Computers Two: Architecture, Programming and Algorithms
Parallel Computers Two: Architecture, Programming and Algorithms
A programming language
Radix sort for vector multiprocessors
Proceedings of the 1991 ACM/IEEE conference on Supercomputing
Evaluation of compiler optimizations for Fortran D on MIMD distributed memory machines
ICS '92 Proceedings of the 6th international conference on Supercomputing
Implementation of a portable nested data-parallel language
PPOPP '93 Proceedings of the fourth ACM SIGPLAN symposium on Principles and practice of parallel programming
Implementing the multiprefix operation on parallel and vector computers
SPAA '93 Proceedings of the fifth annual ACM symposium on Parallel algorithms and architectures
Compiling nested data-parallel programs for shared-memory multiprocessors
ACM Transactions on Programming Languages and Systems (TOPLAS)
List ranking and list scan on the Cray C-90
SPAA '94 Proceedings of the sixth annual ACM symposium on Parallel algorithms and architectures
Parallel solutions to geometric problems in the scan model of computation
Journal of Computer and System Sciences
Compilation of Vector Statements of C[] Language for Architectures with Multilevel Memory Hierarchy
Programming and Computing Software
Advances, Applications and Performance of the Global Arrays Shared Memory Programming Toolkit
International Journal of High Performance Computing Applications
Scan primitives for GPU computing
Proceedings of the 22nd ACM SIGGRAPH/EUROGRAPHICS symposium on Graphics hardware
Interactive relighting of dynamic refractive objects
ACM SIGGRAPH 2008 papers
Fast scan algorithms on graphics processors
Proceedings of the 22nd annual international conference on Supercomputing
Atomic Vector Operations on Chip Multiprocessors
ISCA '08 Proceedings of the 35th Annual International Symposium on Computer Architecture
Efficient stream compaction on wide SIMD many-core architectures
Proceedings of the Conference on High Performance Graphics 2009
GPU-accelerated predicate evaluation on column store
WAIM'10 Proceedings of the 11th international conference on Web-age information management
Accelerating Haskell array codes with multicore GPUs
Proceedings of the sixth workshop on Declarative aspects of multicore programming
Analysis of Parallel Algorithms for Energy Conservation with GPU
GREENCOM-CPSCOM '10 Proceedings of the 2010 IEEE/ACM Int'l Conference on Green Computing and Communications & Int'l Conference on Cyber, Physical and Social Computing
Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming
The Journal of Supercomputing
Energy cost evaluation of parallel algorithms for multiprocessor systems
Cluster Computing
Optimising purely functional GPU programs
Proceedings of the 18th ACM SIGPLAN international conference on Functional programming
Finding extremal sets on the GPU
Journal of Parallel and Distributed Computing
Hi-index | 0.00 |
This paper describes an optimized implementation of a set of scan (also called all-prefix-sums) primitives on a single processor of a CRAY Y-MP, and demonstrates that their use leads to greatly improved performance for several applications that cannot be vectorized with existing compiler technology. The algorithm used to implement the scans is based on an algorithm for parallel computers and is applicable with minor modifications to any register-based vector computer. On the CRAY Y-MP, the asymptotic running time of the plus-scan is about 2.25 times that of a vector add, and is within 20% of optimal. An important aspect of our implementation is that a set of segmented versions of these scans are only marginally more expensive than the unsegmented versions. These segmented versions can be used to execute a scan on multiple data sets without having to pay the vector startup cost (n 1/2) for each set.The paper describes a radix sorting routine based on the scans that is 13 times faster than a Fortran version and within 20% of a highly optimized library sort routine, three operations on trees that are between 10 and 20 times faster than the corresponding C versions, and a connectionist learning algorithm that is 10 times faster than the corresponding C version for sparse and irregular networks.