CUDA optimization strategies for compute- and memory-bound neuroimaging algorithms

Authors:
Daren Lee;Ivo Dinov;Bin Dong;Boris Gutman;Igor Yanovsky;Arthur W. Toga
Affiliations:
Laboratory of Neuro Imaging, David Geffen School of Medicine, UCLA, 635 Charles Young Drive South Suite 225, Los Angeles, CA 90095, USA;Laboratory of Neuro Imaging, David Geffen School of Medicine, UCLA, 635 Charles Young Drive South Suite 225, Los Angeles, CA 90095, USA;Department of Mathematics, University of California, 9500 Gilman Drive, La Jolla, San Diego, CA 92093, USA;Laboratory of Neuro Imaging, David Geffen School of Medicine, UCLA, 635 Charles Young Drive South Suite 225, Los Angeles, CA 90095, USA;Jet Propulsion Laboratory, California Institute of Technology, 4800 Oak Grove Drive, Pasadena, CA 91109, USA;Laboratory of Neuro Imaging, David Geffen School of Medicine, UCLA, 635 Charles Young Drive South Suite 225, Los Angeles, CA 90095, USA
Venue:
Computer Methods and Programs in Biomedicine
Year:
2012

Citing 19
Cited 3

Nonlinear total variation based noise removal algorithms

Proceedings of the eleventh annual international conference of the Center for Nonlinear Studies on Experimental mathematics : computational issues in nonlinear science: computational issues in nonlinear science
Anisotropic geometric diffusion in surface processing

Proceedings of the conference on Visualization '00
Geometric surface smoothing via anisotropic diffusion of normals

Proceedings of the conference on Visualization '02
Bilateral Filtering for Gray and Color Images

ICCV '98 Proceedings of the Sixth International Conference on Computer Vision
Geometric surface processing via normal maps

ACM Transactions on Graphics (TOG)
Smoothing by Example: Mesh Denoising by Averaging with Similarity-Based Weights

SMI '06 Proceedings of the IEEE International Conference on Shape Modeling and Applications 2006
Neighborhood filters and PDE’s

Numerische Mathematik
Speeding up Mutual Information Computation Using NVIDIA CUDA Hardware

DICTA '07 Proceedings of the 9th Biennial Conference of the Australian Pattern Recognition Society on Digital Image Computing Techniques and Applications
Optimization principles and application performance evaluation of a multithreaded GPU using CUDA

Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming
Program optimization space pruning for a multithreaded gpu

Proceedings of the 6th annual IEEE/ACM international symposium on Code generation and optimization
Fast Deformable Registration on the GPU: A CUDA Implementation of Demons

ICCSA '08 Proceedings of the 2008 International Conference on Computational Sciences and Its Applications
Architecture-aware optimization targeting multithreaded stream computing

Proceedings of 2nd Workshop on General Purpose Processing on Graphics Processing Units
Visualization and GPU-accelerated simulation of medical ultrasound from CT images

Computer Methods and Programs in Biomedicine
Efficient GPU-accelerated elastic image registration

BioMED '08 Proceedings of the Sixth IASTED International Conference on Biomedical Engineering
Towards real-time radiation therapy: GPU accelerated superposition/convolution

Computer Methods and Programs in Biomedicine
Parallel computation of mutual information on the GPU with application to real-time registration of 3D medical images

Computer Methods and Programs in Biomedicine
Programming Massively Parallel Processors: A Hands-on Approach

Programming Massively Parallel Processors: A Hands-on Approach
A novel projection based approach for medical image registration

WBIR'06 Proceedings of the Third international conference on Biomedical Image Registration
Deformable templates using large deformation kinematics

IEEE Transactions on Image Processing

Fine-grained resource sharing for concurrent GPGPU kernels

HotPar'12 Proceedings of the 4th USENIX conference on Hot Topics in Parallelism
Parallel implementation of a X-ray tomography reconstruction algorithm based on MPI and CUDA

Proceedings of the 20th European MPI Users' Group Meeting
Cross-Approximate Entropy parallel computation on GPUs for biomedical signal analysis. Application to MEG recordings

Computer Methods and Programs in Biomedicine

Quantified Score

Hi-index	0.00

Visualization

Abstract

As neuroimaging algorithms and technology continue to grow faster than CPU performance in complexity and image resolution, data-parallel computing methods will be increasingly important. The high performance, data-parallel architecture of modern graphical processing units (GPUs) can reduce computational times by orders of magnitude. However, its massively threaded architecture introduces challenges when GPU resources are exceeded. This paper presents optimization strategies for compute- and memory-bound algorithms for the CUDA architecture. For compute-bound algorithms, the registers are reduced through variable reuse via shared memory and the data throughput is increased through heavier thread workloads and maximizing the thread configuration for a single thread block per multiprocessor. For memory-bound algorithms, fitting the data into the fast but limited GPU resources is achieved through reorganizing the data into self-contained structures and employing a multi-pass approach. Memory latencies are reduced by selecting memory resources whose cache performance are optimized for the algorithm's access patterns. We demonstrate the strategies on two computationally expensive algorithms and achieve optimized GPU implementations that perform up to 6x faster than unoptimized ones. Compared to CPU implementations, we achieve peak GPU speedups of 129x for the 3D unbiased nonlinear image registration technique and 93x for the non-local means surface denoising algorithm.