Efficient gather and scatter operations on graphics processors

Authors:
Bingsheng He;Naga K. Govindaraju;Qiong Luo;Burton Smith
Affiliations:
Hong Kong Univ. of Science and Technology;Microsoft Corp.;Hong Kong Univ. of Science and Technology;Microsoft Corp.
Venue:
Proceedings of the 2007 ACM/IEEE conference on Supercomputing
Year:
2007

Citing 19
Cited 23

Vector Computer Memory Bank Contention

IEEE Transactions on Computers
The input/output complexity of sorting and related problems

Communications of the ACM
Radix sort for vector multiprocessors

Proceedings of the 1991 ACM/IEEE conference on Supercomputing
The influence of caches on the performance of sorting

SODA '97 Proceedings of the eighth annual ACM-SIAM symposium on Discrete algorithms
External memory algorithms and data structures: dealing with massive data

ACM Computing Surveys (CSUR)
Fast matrix multiplies using graphics hardware

Proceedings of the 2001 ACM/IEEE conference on Supercomputing
Database Architecture Optimized for the New Bottleneck: Memory Access

VLDB '99 Proceedings of the 25th International Conference on Very Large Data Bases
Cache-Oblivious Algorithms

FOCS '99 Proceedings of the 40th Annual Symposium on Foundations of Computer Science
Photon mapping on programmable graphics hardware

Proceedings of the ACM SIGGRAPH/EUROGRAPHICS conference on Graphics hardware
The FFT on a GPU

Proceedings of the ACM SIGGRAPH/EUROGRAPHICS conference on Graphics hardware
Sparse matrix solvers on the GPU: conjugate gradients and multigrid

ACM SIGGRAPH 2003 Papers
Fast computation of database operations using graphics processors

SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
UberFlow: a GPU-based particle engine

Proceedings of the ACM SIGGRAPH/EUROGRAPHICS conference on Graphics hardware
Understanding the efficiency of GPU algorithms for matrix-matrix multiplication

Proceedings of the ACM SIGGRAPH/EUROGRAPHICS conference on Graphics hardware
Fast and approximate stream mining of quantiles and frequencies using graphics processors

Proceedings of the 2005 ACM SIGMOD international conference on Management of data
LU-GPU: Efficient Algorithms for Solving Dense Linear Systems on Graphics Hardware

SC '05 Proceedings of the 2005 ACM/IEEE conference on Supercomputing
GPUTeraSort: high performance graphics co-processor sorting for large database management

Proceedings of the 2006 ACM SIGMOD international conference on Management of data
A memory model for scientific algorithms on graphics processors

Proceedings of the 2006 ACM/IEEE conference on Supercomputing
Scan primitives for GPU computing

Proceedings of the 22nd ACM SIGGRAPH/EUROGRAPHICS symposium on Graphics hardware

Relational joins on graphics processors

Proceedings of the 2008 ACM SIGMOD international conference on Management of data
Technical Section: A decompression pipeline for accelerating out-of-core volume rendering of time-varying data

Computers and Graphics
Benchmarking GPUs to tune dense linear algebra

Proceedings of the 2008 ACM/IEEE conference on Supercomputing
Mars: a MapReduce framework on graphics processors

Proceedings of the 17th international conference on Parallel architectures and compilation techniques
Accelerating phase unwrapping and affine transformations for optical quadrature microscopy using CUDA

Proceedings of 2nd Workshop on General Purpose Processing on Graphics Processing Units
Solving Sparse Linear Systems on NVIDIA Tesla GPUs

ICCS '09 Proceedings of the 9th International Conference on Computational Science: Part I
Data Parallel Bin-Based Indexing for Answering Queries on Multi-core Architectures

SSDBM 2009 Proceedings of the 21st International Conference on Scientific and Statistical Database Management
Frequent itemset mining on graphics processors

Proceedings of the Fifth International Workshop on Data Management on New Hardware
Relational query coprocessing on graphics processors

ACM Transactions on Database Systems (TODS)
Parallel implementation of a financial application on a GPU

Proceedings of the 2nd International Conference on Interaction Sciences: Information Technology, Culture and Human
Fast in-place sorting with CUDA based on bitonic sort

PPAM'09 Proceedings of the 8th international conference on Parallel processing and applied mathematics: Part I
Accelerating S3D: a GPGPU case study

Euro-Par'09 Proceedings of the 2009 international conference on Parallel processing
Maestro: data orchestration and tuning for OpenCL devices

Euro-Par'10 Proceedings of the 16th international Euro-Par conference on Parallel processing: Part II
Database compression on graphics processors

Proceedings of the VLDB Endowment
Analysis of Parallel Algorithms for Energy Conservation with GPU

GREENCOM-CPSCOM '10 Proceedings of the 2010 IEEE/ACM Int'l Conference on Green Computing and Communications & Int'l Conference on Cyber, Physical and Social Computing
An idiom-finding tool for increasing productivity of accelerators

Proceedings of the international conference on Supercomputing
Dymaxion: optimizing memory access patterns for heterogeneous systems

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Fast GPU-based locality sensitive hashing for k-nearest neighbor computation

Proceedings of the 19th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems
Ameliorating memory contention of OLAP operators on GPU processors

DaMoN '12 Proceedings of the Eighth International Workshop on Data Management on New Hardware
New basic linear algebra methods for simulation on GPUs

Proceedings of the 2011 Grand Challenges on Modeling and Simulation Conference
Parallel approaches to machine learning-A comprehensive survey

Journal of Parallel and Distributed Computing
Energy cost evaluation of parallel algorithms for multiprocessor systems

Cluster Computing
Hardware-oblivious parallelism for in-memory column-stores

Proceedings of the VLDB Endowment

Quantified Score

Hi-index	0.00

Visualization

Abstract

Gather and scatter are two fundamental data-parallel operations, where a large number of data items are read (gathered) from or are written (scattered) to given locations. In this paper, we study these two operations on graphics processing units (GPUs). With superior computing power and high memory bandwidth, GPUs have become a commodity multiprocessor platform for general-purpose high-performance computing. However, due to the random access nature of gather and scatter, a naive implementation of the two operations suffers from a low utilization of the memory bandwidth and consequently a long, unhidden memory latency. Additionally, the architectural details of the GPUs, in particular, the memory hierarchy design, are unclear to the programmers. Therefore, we design multi-pass gather and scatter operations to improve their data access locality, and develop a performance model to help understand and optimize these two operations. We have evaluated our algorithms in sorting, hashing, and the sparse matrix-vector multiplication in comparison with their optimized CPU counterparts. Our results show that these optimizations yield 2--4X improvement on the GPU bandwidth utilization and 30--50% improvement on the response time. Overall, our optimized GPU implementations are 2--7X faster than their optimized CPU counterparts.