Performance analysis of accelerated image registration using GPGPU

Authors:
Peter Bui;Jay Brockman
Affiliations:
University of Notre Dame;University of Notre Dame
Venue:
Proceedings of 2nd Workshop on General Purpose Processing on Graphics Processing Units
Year:
2009

Citing 4
Cited 3

Benchmarking GPUs to tune dense linear algebra

Proceedings of the 2008 ACM/IEEE conference on Supercomputing
Fast Deformable Registration on the GPU: A CUDA Implementation of Demons

ICCSA '08 Proceedings of the 2008 International Conference on Computational Sciences and Its Applications
A GPGPU approach for accelerating 2-d/3-d rigid registration of medical images

ISPA'06 Proceedings of the 4th international conference on Parallel and Distributed Processing and Applications
A pyramid approach to subpixel registration based on intensity

IEEE Transactions on Image Processing

Remote sensing image registration techniques: a survey

ICISP'10 Proceedings of the 4th international conference on Image and signal processing
True 4D image denoising on the GPU

Journal of Biomedical Imaging - Special issue on Parallel Computation in Medical Imaging Applications
Design space exploration towards a realtime and energy-aware GPGPU-based analysis of biosensor data

Computer Science - Research and Development

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper presents a performance analysis of an accelerated 2-D rigid image registration implementation that employs the Compute Unified Device Architecture (CUDA) programming environment to take advantage of the parallel processing capabilities of NVIDIA's Tesla C870 GPU. We explain the underlying structure of the GPU implementation and compare its performance and accuracy against a fast CPU-based implementation. Our experimental results demonstrate that our GPU version is capable of up to 90x speedup with bilinear interpolation and 30x speedup with bicubic interpolation while maintaining a high level of accuracy. This compares favorably to recent image registration studies, but it also indicates that our implementation only reaches about 70% of theorectical peak performance. To analyze our results, we utilize profiling data to identify some of the underlying limitations of CUDA that prohibit peak performance. At the end, we emphasize the need to manage memory resources carefully to fully utilize the GPU and obtain maximum speedup.