Communications of the ACM - Special issue on parallelism
Efficient parallel algorithms
Parallel computing (2nd ed.): theory and practice
Parallel computing (2nd ed.): theory and practice
Parallel Sorting Algorithms
Data Structures and Algorithms
Data Structures and Algorithms
The NYU Ultracomputer Designing an MIMD Shared Memory Parallel Computer
IEEE Transactions on Computers
Access and Alignment of Data in an Array Processor
IEEE Transactions on Computers
Sorting networks and their applications
AFIPS '68 (Spring) Proceedings of the April 30--May 2, 1968, spring joint computer conference
Some computer organizations and their effectiveness
IEEE Transactions on Computers
GPU Computing Gems Emerald Edition
GPU Computing Gems Emerald Edition
A GPU Implementation of Computing Euclidean Distance Map with Efficient Memory Access
ICNC '11 Proceedings of the 2011 Second International Conference on Networking and Computing
Fast and Accurate Template Matching Using Pixel Rearrangement on the GPU
ICNC '11 Proceedings of the 2011 Second International Conference on Networking and Computing
Fast Ellipse Detection Algorithm Using Hough Transform on the GPU
ICNC '11 Proceedings of the 2011 Second International Conference on Networking and Computing
Accelerating the Dynamic Programming for the Matrix Chain Product on the GPU
ICNC '11 Proceedings of the 2011 Second International Conference on Networking and Computing
Simple Memory Machine Models for GPUs
IPDPSW '12 Proceedings of the 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum
Hi-index | 0.00 |
The main contribution of this paper is to show optimal algorithms computing the sum and the prefix-sums on two memory machine models, the Discrete Memory Machine (DMM) and the Unified Memory Machine (UMM). The DMM and the UMM are theoretical parallel computing models that capture the essence of the shared memory and the global memory of GPUs. These models have three parameters, the number p of threads, the width w of the memory, and the memory access latency l. We first show that the sum of n numbers can be computed in $O({n\over w}+{nl\over p}+l\log n)$ time units on the DMM and the UMM. We then go on to show that $\Omega({n\over w}+{nl\over p}+l\log n)$ time units are necessary to compute the sum. Finally, we show an optimal parallel algorithm that computes the prefix-sums of n numbers in $O({n\over w}+{nl\over p}+l\log n)$ time units on the DMM and the UMM.