Vector models for data-parallel computing
Vector models for data-parallel computing
Scan primitives for vector computers
Proceedings of the 1990 ACM/IEEE conference on Supercomputing
Solving linear recurrences with loop raking
Journal of Parallel and Distributed Computing
ACM Transactions on Programming Languages and Systems (TOPLAS)
Segmented Operations for Sparse Matrix Computation on Vector Multiprocessors
Segmented Operations for Sparse Matrix Computation on Vector Multiprocessors
A programming language
Scan primitives for GPU computing
Proceedings of the 22nd ACM SIGGRAPH/EUROGRAPHICS symposium on Graphics hardware
Cache simulator based on GPU acceleration
Proceedings of the 2nd International Conference on Simulation Tools and Techniques
Fast and scalable list ranking on the GPU
Proceedings of the 23rd international conference on Supercomputing
Single-particle 3d reconstruction from cryo-electron microscopy images on GPU
Proceedings of the 23rd international conference on Supercomputing
Efficient stream compaction on wide SIMD many-core architectures
Proceedings of the Conference on High Performance Graphics 2009
Fast minimum spanning tree for large graphs on the GPU
Proceedings of the Conference on High Performance Graphics 2009
GCSim: A GPU-Based Trace-Driven Simulator for Multi-level Cache
APPT '09 Proceedings of the 8th International Symposium on Advanced Parallel Processing Technologies
Efficient band approximation of Gram matrices for large scale kernel methods on GPUs
Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Fast tridiagonal solvers on the GPU
Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
Accelerating MATLAB Image Processing Toolbox functions on GPUs
Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units
Proceedings of the 24th ACM International Conference on Supercomputing
Analysis of Parallel Algorithms for Energy Conservation with GPU
GREENCOM-CPSCOM '10 Proceedings of the 2010 IEEE/ACM Int'l Conference on Green Computing and Communications & Int'l Conference on Cyber, Physical and Social Computing
International Journal of High Performance Computing Applications
Experience of parallelizing cryo-EM 3D reconstruction on a CPU-GPU heterogeneous system
Proceedings of the 20th international symposium on High performance distributed computing
SAH KD-tree construction on GPU
Proceedings of the ACM SIGGRAPH Symposium on High Performance Graphics
GPU-efficient recursive filtering and summed-area tables
Proceedings of the 2011 SIGGRAPH Asia Conference
Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming
StreamScan: fast scan algorithms for GPUs without global barrier synchronization
Proceedings of the 18th ACM SIGPLAN symposium on Principles and practice of parallel programming
Energy cost evaluation of parallel algorithms for multiprocessor systems
Cluster Computing
Optimising lossless stages in a GPU-based MPEG encoder
Multimedia Tools and Applications
yaSpMV: yet another SpMV framework on GPUs
Proceedings of the 19th ACM SIGPLAN symposium on Principles and practice of parallel programming
Hi-index | 0.00 |
Scan and segmented scan are important data-parallel primitives for a wide range of applications. We present fast, work-efficient algorithms for these primitives on graphics processing units (GPUs). We use novel data representations that map well to the GPU architecture. Our algorithms exploit shared memory to improve memory performance. We further improve the performance of our algorithms by eliminating shared-memory bank conflicts and reducing the overheads in prior shared-memory GPU algorithms. Furthermore, our algorithms are designed to work well on general data sets, including segmented arrays with arbitrary segment lengths. We also present optimizations to improve the performance of segmented scans based on the segment lengths. We implemented our algorithms on a PC with an NVIDIA GeForce 8800 GPU and compared our results with prior GPU-based algorithms. Our results indicate up to 10x higher performance over prior algorithms on input sequences with millions of elements.