Fast scan algorithms on graphics processors

Authors:
Yuri Dotsenko;Naga K. Govindaraju;Peter-Pike Sloan;Charles Boyd;John Manferdelli
Affiliations:
Microsoft Corporation, Redmond, WA, USA;Microsoft Corporation, Redmond, WA, USA;Microsoft Corporation, Redmond, WA, USA;Microsoft Corporation, Redmond, WA, USA;Microsoft Corporation, Redmond, WA, USA
Venue:
Proceedings of the 22nd annual international conference on Supercomputing
Year:
2008

Citing 7
Cited 20

Vector models for data-parallel computing

Vector models for data-parallel computing
Scan primitives for vector computers

Proceedings of the 1990 ACM/IEEE conference on Supercomputing
Solving linear recurrences with loop raking

Journal of Parallel and Distributed Computing
Ultracomputers

ACM Transactions on Programming Languages and Systems (TOPLAS)
Segmented Operations for Sparse Matrix Computation on Vector Multiprocessors

Segmented Operations for Sparse Matrix Computation on Vector Multiprocessors
A programming language

A programming language
Scan primitives for GPU computing

Proceedings of the 22nd ACM SIGGRAPH/EUROGRAPHICS symposium on Graphics hardware

Cache simulator based on GPU acceleration

Proceedings of the 2nd International Conference on Simulation Tools and Techniques
Fast and scalable list ranking on the GPU

Proceedings of the 23rd international conference on Supercomputing
Single-particle 3d reconstruction from cryo-electron microscopy images on GPU

Proceedings of the 23rd international conference on Supercomputing
Efficient stream compaction on wide SIMD many-core architectures

Proceedings of the Conference on High Performance Graphics 2009
Fast minimum spanning tree for large graphs on the GPU

Proceedings of the Conference on High Performance Graphics 2009
GCSim: A GPU-Based Trace-Driven Simulator for Multi-level Cache

APPT '09 Proceedings of the 8th International Symposium on Advanced Parallel Processing Technologies
Efficient band approximation of Gram matrices for large scale kernel methods on GPUs

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Fast tridiagonal solvers on the GPU

Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
Accelerating MATLAB Image Processing Toolbox functions on GPUs

Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units
Streamlining GPU applications on the fly: thread divergence elimination through runtime thread-data remapping

Proceedings of the 24th ACM International Conference on Supercomputing
Analysis of Parallel Algorithms for Energy Conservation with GPU

GREENCOM-CPSCOM '10 Proceedings of the 2010 IEEE/ACM Int'l Conference on Green Computing and Communications & Int'l Conference on Cyber, Physical and Social Computing
Load Balancing versus Occupancy Maximization on Graphics Processing Units: The Generalized Hough Transform as a Case Study

International Journal of High Performance Computing Applications
Experience of parallelizing cryo-EM 3D reconstruction on a CPU-GPU heterogeneous system

Proceedings of the 20th international symposium on High performance distributed computing
SAH KD-tree construction on GPU

Proceedings of the ACM SIGGRAPH Symposium on High Performance Graphics
GPU-efficient recursive filtering and summed-area tables

Proceedings of the 2011 SIGGRAPH Asia Conference
Scalable GPU graph traversal

Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming
StreamScan: fast scan algorithms for GPUs without global barrier synchronization

Proceedings of the 18th ACM SIGPLAN symposium on Principles and practice of parallel programming
Energy cost evaluation of parallel algorithms for multiprocessor systems

Cluster Computing
Optimising lossless stages in a GPU-based MPEG encoder

Multimedia Tools and Applications
yaSpMV: yet another SpMV framework on GPUs

Proceedings of the 19th ACM SIGPLAN symposium on Principles and practice of parallel programming

Quantified Score

Hi-index	0.00

Visualization

Abstract

Scan and segmented scan are important data-parallel primitives for a wide range of applications. We present fast, work-efficient algorithms for these primitives on graphics processing units (GPUs). We use novel data representations that map well to the GPU architecture. Our algorithms exploit shared memory to improve memory performance. We further improve the performance of our algorithms by eliminating shared-memory bank conflicts and reducing the overheads in prior shared-memory GPU algorithms. Furthermore, our algorithms are designed to work well on general data sets, including segmented arrays with arbitrary segment lengths. We also present optimizations to improve the performance of segmented scans based on the segment lengths. We implemented our algorithms on a PC with an NVIDIA GeForce 8800 GPU and compared our results with prior GPU-based algorithms. Our results indicate up to 10x higher performance over prior algorithms on input sequences with millions of elements.