An optimal parallel prefix-sums algorithm on the memory machine models for GPUs

Authors:
Koji Nakano
Affiliations:
Department of Information Engineering, Hiroshima University, Higashi Hiroshima, Japan
Venue:
ICA3PP'12 Proceedings of the 12th international conference on Algorithms and Architectures for Parallel Processing - Volume Part I
Year:
2012

Citing 15
Cited 0

Data parallel algorithms

Communications of the ACM - Special issue on parallelism
Efficient parallel algorithms

Efficient parallel algorithms
Parallel computing (2nd ed.): theory and practice

Parallel computing (2nd ed.): theory and practice
Parallel Sorting Algorithms

Parallel Sorting Algorithms
Data Structures and Algorithms

Data Structures and Algorithms
The NYU Ultracomputer Designing an MIMD Shared Memory Parallel Computer

IEEE Transactions on Computers
Access and Alignment of Data in an Array Processor

IEEE Transactions on Computers
Sorting networks and their applications

AFIPS '68 (Spring) Proceedings of the April 30--May 2, 1968, spring joint computer conference
Some computer organizations and their effectiveness

IEEE Transactions on Computers
GPU Computing Gems Emerald Edition

GPU Computing Gems Emerald Edition
A GPU Implementation of Computing Euclidean Distance Map with Efficient Memory Access

ICNC '11 Proceedings of the 2011 Second International Conference on Networking and Computing
Fast and Accurate Template Matching Using Pixel Rearrangement on the GPU

ICNC '11 Proceedings of the 2011 Second International Conference on Networking and Computing
Fast Ellipse Detection Algorithm Using Hough Transform on the GPU

ICNC '11 Proceedings of the 2011 Second International Conference on Networking and Computing
Accelerating the Dynamic Programming for the Matrix Chain Product on the GPU

ICNC '11 Proceedings of the 2011 Second International Conference on Networking and Computing
Simple Memory Machine Models for GPUs

IPDPSW '12 Proceedings of the 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum

Quantified Score

Hi-index	0.00

Visualization

Abstract

The main contribution of this paper is to show optimal algorithms computing the sum and the prefix-sums on two memory machine models, the Discrete Memory Machine (DMM) and the Unified Memory Machine (UMM). The DMM and the UMM are theoretical parallel computing models that capture the essence of the shared memory and the global memory of GPUs. These models have three parameters, the number p of threads, the width w of the memory, and the memory access latency l. We first show that the sum of n numbers can be computed in $O({n\over w}+{nl\over p}+l\log n)$ time units on the DMM and the UMM. We then go on to show that $\Omega({n\over w}+{nl\over p}+l\log n)$ time units are necessary to compute the sum. Finally, we show an optimal parallel algorithm that computes the prefix-sums of n numbers in $O({n\over w}+{nl\over p}+l\log n)$ time units on the DMM and the UMM.