High Performance Compilers for Parallel Computing
High Performance Compilers for Parallel Computing
Bilateral Filtering for Gray and Color Images
ICCV '98 Proceedings of the Sixth International Conference on Computer Vision
An Image Processor for Digital Film
ASAP '05 Proceedings of the 2005 IEEE International Conference on Application-Specific Systems, Architecture Processors
A Design Methodology for Hardware Acceleration of Adaptive Filter Algorithms in Image Processing
ASAP '06 Proceedings of the IEEE 17th International Conference on Application-specific Systems, Architectures and Processors
Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming
Optimization principles and application performance evaluation of a multithreaded GPU using CUDA
Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming
Accelerating advanced mri reconstructions on gpus
Proceedings of the 5th conference on Computing frontiers
The JPEG2000 still image coding system: an overview
IEEE Transactions on Consumer Electronics
Hi-index | 0.00 |
In the last decade, there has been a dramatic growth in research and development of massively parallel commodity graphics hardware both in academia and industry. Graphics card architectures provide an optimal platform for parallel execution of many number crunching loop programs from fields like image processing, linear algebra, etc. However, it is hard to efficiently map such algorithms to the graphics hardware even with detailed insight into the architecture. This paper presents a multiresolution image processing algorithm and shows the efficient mapping of this type of algorithms to the graphics hardware. Furthermore, the impact of execution configuration is illustrated and a method is proposed to determine the best configuration offline in order to use it at run-time. Using CUDA as programming model, it is demonstrated that the image processing algorithm is significantly accelerated and that a speedup of up to 33x can be achieved on NVIDIA's Tesla C870 compared to a parallelized implementation on a Xeon Quad Core.