StreamScan: fast scan algorithms for GPUs without global barrier synchronization

Authors:
Shengen Yan;Guoping Long;Yunquan Zhang
Affiliations:
Institute of Software, Chinese Academy of Sciences, Beijing, China;Institute of Software, Chinese Academy of Sciences, Beijing, China;Institute of Software, Chinese Academy of Sciences, Beijing, China
Venue:
Proceedings of the 18th ACM SIGPLAN symposium on Principles and practice of parallel programming
Year:
2013

Citing 11
Cited 1

Fast scan algorithms on graphics processors

Proceedings of the 22nd annual international conference on Supercomputing
On sorting and load balancing on GPUs

ACM SIGARCH Computer Architecture News
GPU-Quicksort: A practical Quicksort algorithm for graphics processors

Journal of Experimental Algorithmics (JEA)
Efficient stream compaction on wide SIMD many-core architectures

Proceedings of the Conference on High Performance Graphics 2009
Fast minimum spanning tree for large graphs on the GPU

Proceedings of the Conference on High Performance Graphics 2009
Designing efficient sorting algorithms for manycore GPUs

IPDPS '09 Proceedings of the 2009 IEEE International Symposium on Parallel&Distributed Processing
Fast in-place sorting with CUDA based on bitonic sort

PPAM'09 Proceedings of the 8th international conference on Parallel processing and applied mathematics: Part I
Static GPU threads and an improved scan algorithm

Euro-Par 2010 Proceedings of the 2010 conference on Parallel processing
Scalable GPU graph traversal

Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming
A Novel Parallel Scan for Multicore Processors and Its Application in Sparse Matrix-Vector Multiplication

IEEE Transactions on Parallel and Distributed Systems
GPURoofline: a model for guiding performance optimizations on GPUs

Euro-Par'12 Proceedings of the 18th international conference on Parallel Processing

yaSpMV: yet another SpMV framework on GPUs

Proceedings of the 19th ACM SIGPLAN symposium on Principles and practice of parallel programming

Quantified Score

Hi-index	0.00

Visualization

Abstract

Scan (also known as prefix sum) is a very useful primitive for various important parallel algorithms, such as sort, BFS, SpMV, compaction and so on. Current state of the art of GPU based scan implementation consists of three consecutive Reduce-Scan-Scan phases. This approach requires at least two global barriers and 3N (N is the problem size) global memory accesses. In this paper we propose StreamScan, a novel approach to implement scan on GPUs with only one computation phase. The main idea is to restrict synchronization to only adjacent workgroups, and thereby eliminating global barrier synchronization completely. The new approach requires only 2N global memory accesses and just one kernel invocation. On top of this we propose two important op-timizations to further boost performance speedups, namely thread grouping to eliminate unnecessary local barriers, and register optimization to expand the on chip problem size. We designed an auto-tuning framework to search the parameter space automatically to generate highly optimized codes for both AMD and Nvidia GPUs. We implemented our technique with OpenCL. Compared with previous fast scan implementations, experimental results not only show promising performance speedups, but also reveal dramatic different optimization tradeoffs between Nvidia and AMD GPU platforms.