Load Balancing versus Occupancy Maximization on Graphics Processing Units: The Generalized Hough Transform as a Case Study

Authors:
Juan Gómez-Luna;José María González-Linares;José Ignacio Benavides;Emilio L. Zapata;Nicolás Guil
Affiliations:
Computer Architecture and Electronics Department, University of Córdoba, Córdoba, Spain;Computer Architecture Department, University of Málaga,Málaga, Spain;Computer Architecture and Electronics Department, University of Córdoba, Córdoba;Computer Architecture Department, University of Málaga,Málaga, Spain;Computer Architecture Department, University of Málaga,Málaga, Spain
Venue:
International Journal of High Performance Computing Applications
Year:
2011

Citing 8
Cited 1

A Graphics Hardware Implementation of the Generalized Hough Transform for fast Object Recognition, Scale, and 3D Pose Detection

ICIAP '03 Proceedings of the 12th International Conference on Image Analysis and Processing
Fast scan algorithms on graphics processors

Proceedings of the 22nd annual international conference on Supercomputing
Relational joins on graphics processors

Proceedings of the 2008 ACM SIGMOD international conference on Management of data
Program optimization carving for GPU computing

Journal of Parallel and Distributed Computing
Benchmarking GPUs to tune dense linear algebra

Proceedings of the 2008 ACM/IEEE conference on Supercomputing
Tuned and wildly asynchronous stencil kernels for hybrid CPU/GPU systems

Proceedings of the 23rd international conference on Supercomputing
Efficient stream compaction on wide SIMD many-core architectures

Proceedings of the Conference on High Performance Graphics 2009
Parallelization of a Video Segmentation Algorithm on CUDA---Enabled Graphics Processing Units

Euro-Par '09 Proceedings of the 15th International Euro-Par Conference on Parallel Processing

CPU/GPU computing for long-wave radiation physics on large GPU clusters

Computers & Geosciences

Quantified Score

Hi-index	0.00

Visualization

Abstract

Programs developed under the Compute Unified Device Architecture obtain the highest performance rate, when the exploitation of hardware resources on a Graphics Processing Unit (GPU) is maximized. In order to achieve this purpose, load balancing among threads and a high value of processor occupancy, i.e. the ratio of active threads, are indispensable. However, in certain applications, an optimally balanced implementation may limit the occupancy, due to a greater need for registers and shared memory. This is the case of the Fast Generalized Hough Transform (Fast GHT), an image-processing technique for localizing an object within an image. In this work, we present two parallelization alternatives for the Fast GHT, one that optimizes the load balancing and another that maximizes the occupancy. We have compared them using a large amount of real images to test their strong and weak points and we have drawn several conclusions about under which conditions it is better to use one or the other. We have also tackled several parallelization problems related to sparse data distribution, divergent execution paths, and irregular memory access patterns in updating operations by proposing a set of generic techniques, including compacting, sorting, and memory storage replication. Finally, we have compared our Fast GHT with the classic GHT, both on a current GPU, obtaining an important speed-up.