Efficient stream compaction on wide SIMD many-core architectures

Authors:
Markus Billeter;Ola Olsson;Ulf Assarsson
Affiliations:
Chalmers University of Technology;Chalmers University of Technology;Chalmers University of Technology
Venue:
Proceedings of the Conference on High Performance Graphics 2009
Year:
2009

Citing 9
Cited 17

Data parallel algorithms

Communications of the ACM - Special issue on parallelism
Scan primitives for vector computers

Proceedings of the 1990 ACM/IEEE conference on Supercomputing
Scan primitives for GPU computing

Proceedings of the 22nd ACM SIGGRAPH/EUROGRAPHICS symposium on Graphics hardware
Larrabee: a many-core x86 architecture for visual computing

ACM SIGGRAPH 2008 papers
NVIDIA Tesla: A Unified Graphics and Computing Architecture

IEEE Micro
Fast scan algorithms on graphics processors

Proceedings of the 22nd annual international conference on Supercomputing
A closer look at GPUs

Communications of the ACM
Real-time KD-tree construction on graphics hardware

ACM SIGGRAPH Asia 2008 papers
Whitted ray-tracing for dynamic scenes using a ray-space hierarchy on the GPU

EGSR'07 Proceedings of the 18th Eurographics conference on Rendering Techniques

Technical Section: Parallel generation of multiple L-systems

Computers and Graphics
Fast parallel surface and solid voxelization on GPUs

ACM SIGGRAPH Asia 2010 papers
Collision-streams: fast GPU-based collision detection for deformable models

I3D '11 Symposium on Interactive 3D Graphics and Games
Load Balancing versus Occupancy Maximization on Graphics Processing Units: The Generalized Hough Transform as a Case Study

International Journal of High Performance Computing Applications
Efficient parallel lists intersection and index compression algorithms using graphics processing units

Proceedings of the VLDB Endowment
Expressive array constructs in an embedded GPU kernel programming language

DAMP '12 Proceedings of the 7th workshop on Declarative aspects and applications of multicore programming
Fast GPU perspective grid construction and triangle tracing for exhaustive ray tracing of highly coherent rays

International Journal of High Performance Computing Applications
Clustered deferred and forward shading

EGGH-HPG'12 Proceedings of the Fourth ACM SIGGRAPH / Eurographics conference on High-Performance Graphics
Ray tracing dynamic scenes with shadows on GPU

EG PGV'10 Proceedings of the 10th Eurographics conference on Parallel Graphics and Visualization
Data-Parallel Decompression of Triangle Mesh Topology

Computer Graphics Forum
StreamScan: fast scan algorithms for GPUs without global barrier synchronization

Proceedings of the 18th ACM SIGPLAN symposium on Principles and practice of parallel programming
Kernel Weaver: Automatically Fusing Database Primitives for Efficient GPU Computation

MICRO-45 Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture
Accelerating simulation of agent-based models on heterogeneous architectures

Proceedings of the 6th Workshop on General Purpose Processor Using Graphics Processing Units
High resolution sparse voxel DAGs

ACM Transactions on Graphics (TOG) - SIGGRAPH 2013 Conference Proceedings
Accelerating wildfire susceptibility mapping through GPGPU

Journal of Parallel and Distributed Computing
Barrier invariants: a shared state abstraction for the analysis of data-dependent GPU kernels

Proceedings of the 2013 ACM SIGPLAN international conference on Object oriented programming systems languages & applications
A sound and complete abstraction for reasoning about parallel prefix sums

Proceedings of the 41st ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages

Quantified Score

Hi-index	0.00

Visualization

Abstract

Stream compaction is a common parallel primitive used to remove unwanted elements in sparse data. This allows highly parallel algorithms to maintain performance over several processing steps and reduces overall memory usage. For wide SIMD many-core architectures, we present a novel stream compaction algorithm and explore several variations thereof. Our algorithm is designed to maximize concurrent execution, with minimal use of synchronization. Bandwidth and auxiliary storage requirements are reduced significantly, which allows for substantially better performance. We have tested our algorithms using CUDA on a PC with an NVIDIA GeForce GTX280 GPU. On this hardware, our reference implementation provides a 3x speedup over previous published algorithms.