GPU Performance Enhancement via Communication Cost Reduction: Case Studies of Radix Sort and WSN Relay Node Placement Problem

Authors:
Che-Rung Lee;Shih-Hsiang Lo;Nan-Hsi Chen;Yeh-Ching Chung;I-Hsin Chung
Affiliations:
-;-;-;-;-
Venue:
CCGRID '12 Proceedings of the 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012)
Year:
2012

Citing 18
Cited 0

Steiner tree problem with minimum number of Steiner points and bounded edge-length

Information Processing Letters
Introduction to Algorithms

Introduction to Algorithms
Approximations for Steiner Trees with Minimum Number of Steiner Points

Journal of Global Optimization
Cache-Oblivious Algorithms

FOCS '99 Proceedings of the 40th Annual Symposium on Foundations of Computer Science
A memory model for scientific algorithms on graphics processors

Proceedings of the 2006 ACM/IEEE conference on Supercomputing
Scan primitives for GPU computing

Proceedings of the 22nd ACM SIGGRAPH/EUROGRAPHICS symposium on Graphics hardware
Efficient computation of sum-products on GPUs through software-managed cache

Proceedings of the 22nd annual international conference on Supercomputing
Relay sensor placement in wireless sensor networks

Wireless Networks
Benchmarking GPUs to tune dense linear algebra

Proceedings of the 2008 ACM/IEEE conference on Supercomputing
Architecture-aware optimization targeting multithreaded stream computing

Proceedings of 2nd Workshop on General Purpose Processing on Graphics Processing Units
Designing efficient sorting algorithms for manycore GPUs

IPDPS '09 Proceedings of the 2009 IEEE International Symposium on Parallel&Distributed Processing
Optimizing data intensive GPGPU computations for DNA sequence alignment

Parallel Computing
Data transformations enabling loop vectorization on multithreaded data parallel architectures

Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
Multi GPU implementation of iterative tomographic reconstruction algorithms

ISBI'09 Proceedings of the Sixth IEEE international conference on Symposium on Biomedical Imaging: From Nano to Macro
Size Matters: Space/Time Tradeoffs to Improve GPGPU Applications Performance

Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
A Parallel Rectangle Intersection Algorithm on GPU+CPU

CCGRID '11 Proceedings of the 2011 11th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing
Communication-Avoiding QR Decomposition for GPUs

IPDPS '11 Proceedings of the 2011 IEEE International Parallel & Distributed Processing Symposium
Dymaxion: optimizing memory access patterns for heterogeneous systems

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis

Quantified Score

Hi-index	0.00

Visualization

Abstract

As the computational power of Graphics Processing Unit (GPU) increases, data transmission becomes the major performance bottleneck. In this study, we investigate two techniques, data streaming and data compression, to reduce the communication cost on GPU. Data streaming enables overlap of communication and computation, whereas data compression reduces the data size transferred among different memory spaces. Although both techniques increase computation cost, overall performance can still be enhanced by reducing communication cost. We demonstrate the effectiveness of the two techniques via two case studies: radix sort and 3-star, a deployment algorithm in wireless sensor networks. For radix sort, a new algorithm, which mixes MSD and LSD algorithms and employs data streaming, is presented. Its performance is 25% faster than the fastest GPU radix sort implementation currently available in the public domain. For the 3-star algorithm, the speed increases several hundreds of times faster than that obtained by the CPU code. The data streaming and data compression, which is a hybrid CPU-GPU algorithm, provide an additional 54% performance improvement to the GPU implementation. Data compression not only reduces communication cost, but also improves the computation time, by which further performance enhancement can be achieved.