Nonlinear total variation based noise removal algorithms
Proceedings of the eleventh annual international conference of the Center for Nonlinear Studies on Experimental mathematics : computational issues in nonlinear science: computational issues in nonlinear science
Anisotropic geometric diffusion in surface processing
Proceedings of the conference on Visualization '00
Geometric surface smoothing via anisotropic diffusion of normals
Proceedings of the conference on Visualization '02
Bilateral Filtering for Gray and Color Images
ICCV '98 Proceedings of the Sixth International Conference on Computer Vision
Geometric surface processing via normal maps
ACM Transactions on Graphics (TOG)
Smoothing by Example: Mesh Denoising by Averaging with Similarity-Based Weights
SMI '06 Proceedings of the IEEE International Conference on Shape Modeling and Applications 2006
Neighborhood filters and PDE’s
Numerische Mathematik
Speeding up Mutual Information Computation Using NVIDIA CUDA Hardware
DICTA '07 Proceedings of the 9th Biennial Conference of the Australian Pattern Recognition Society on Digital Image Computing Techniques and Applications
Optimization principles and application performance evaluation of a multithreaded GPU using CUDA
Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming
Program optimization space pruning for a multithreaded gpu
Proceedings of the 6th annual IEEE/ACM international symposium on Code generation and optimization
Fast Deformable Registration on the GPU: A CUDA Implementation of Demons
ICCSA '08 Proceedings of the 2008 International Conference on Computational Sciences and Its Applications
Architecture-aware optimization targeting multithreaded stream computing
Proceedings of 2nd Workshop on General Purpose Processing on Graphics Processing Units
Visualization and GPU-accelerated simulation of medical ultrasound from CT images
Computer Methods and Programs in Biomedicine
Efficient GPU-accelerated elastic image registration
BioMED '08 Proceedings of the Sixth IASTED International Conference on Biomedical Engineering
Towards real-time radiation therapy: GPU accelerated superposition/convolution
Computer Methods and Programs in Biomedicine
Computer Methods and Programs in Biomedicine
Programming Massively Parallel Processors: A Hands-on Approach
Programming Massively Parallel Processors: A Hands-on Approach
A novel projection based approach for medical image registration
WBIR'06 Proceedings of the Third international conference on Biomedical Image Registration
Deformable templates using large deformation kinematics
IEEE Transactions on Image Processing
Fine-grained resource sharing for concurrent GPGPU kernels
HotPar'12 Proceedings of the 4th USENIX conference on Hot Topics in Parallelism
Parallel implementation of a X-ray tomography reconstruction algorithm based on MPI and CUDA
Proceedings of the 20th European MPI Users' Group Meeting
Computer Methods and Programs in Biomedicine
Hi-index | 0.00 |
As neuroimaging algorithms and technology continue to grow faster than CPU performance in complexity and image resolution, data-parallel computing methods will be increasingly important. The high performance, data-parallel architecture of modern graphical processing units (GPUs) can reduce computational times by orders of magnitude. However, its massively threaded architecture introduces challenges when GPU resources are exceeded. This paper presents optimization strategies for compute- and memory-bound algorithms for the CUDA architecture. For compute-bound algorithms, the registers are reduced through variable reuse via shared memory and the data throughput is increased through heavier thread workloads and maximizing the thread configuration for a single thread block per multiprocessor. For memory-bound algorithms, fitting the data into the fast but limited GPU resources is achieved through reorganizing the data into self-contained structures and employing a multi-pass approach. Memory latencies are reduced by selecting memory resources whose cache performance are optimized for the algorithm's access patterns. We demonstrate the strategies on two computationally expensive algorithms and achieve optimized GPU implementations that perform up to 6x faster than unoptimized ones. Compared to CPU implementations, we achieve peak GPU speedups of 129x for the 3D unbiased nonlinear image registration technique and 93x for the non-local means surface denoising algorithm.