Commodity cluster-based parallel processing of hyperspectral imagery
Journal of Parallel and Distributed Computing
FFT and Convolution Performance in Image Filtering on GPU
IV '06 Proceedings of the conference on Information Visualization
The Semantic Pathfinder: Using an Authoring Metaphor for Generic Multimedia Indexing
IEEE Transactions on Pattern Analysis and Machine Intelligence
Program optimization space pruning for a multithreaded gpu
Proceedings of the 6th annual IEEE/ACM international symposium on Code generation and optimization
Roofline: an insightful visual performance model for multicore architectures
Communications of the ACM - A Direct Path to Dependable Software
A Grid framework to enable parallel and concurrent TMA image analyses
International Journal of Grid and Utility Computing
A GPGPU compiler for memory optimization and parallelism management
PLDI '10 Proceedings of the 2010 ACM SIGPLAN conference on Programming language design and implementation
Accelerated 2d image processing on GPUs
ICCS'05 Proceedings of the 5th international conference on Computational Science - Volume Part II
On the use of small 2d convolutions on GPUs
ISCA'10 Proceedings of the 2010 international conference on Computer Architecture
Hi-index | 0.00 |
The research domain of Multimedia Content Analysis (MMCA) considers all aspects of the automated extraction of knowledge from multimedia data. High-performance computing techniques are necessary to satisfy the ever increasing computational demands of MMCA applications. The introduction of Graphics Processing Units (GPUs) in modern cluster systems presents application developers with a challenge. While GPUs are well known to be capable of providing significant performance improvements, the programming complexity vastly increases. To this end, we have extended a user transparent parallel programming model for MMCA, named Parallel-Horus, to allow the execution of compute intensive operations on the GPUs present in the cluster. The most important class of operations in the MMCA domain are convolutions, which are typically responsible for a large fraction of the execution time. Existing optimization approaches for CUDA kernels in general as well as those specific to convolution operations are too limited in both performance and flexibility. In this paper, we present a new optimization approach, called adaptive tiling, to implement a highly efficient, yet flexible, library-based convolution operation for modern GPUs. To the best of our knowledge, our implementation is the most optimized and best performing implementation of 2D convolution in the spatial domain available to date.