Optimizing convolution operations on GPUs using adaptive tiling

Authors:
Ben Van Werkhoven;Jason Maassen;Henri E. Bal;Frank J. Seinstra
Affiliations:
-;-;-;-
Venue:
Future Generation Computer Systems
Year:
2014

Citing 10
Cited 0

Commodity cluster-based parallel processing of hyperspectral imagery

Journal of Parallel and Distributed Computing
FFT and Convolution Performance in Image Filtering on GPU

IV '06 Proceedings of the conference on Information Visualization
The Semantic Pathfinder: Using an Authoring Metaphor for Generic Multimedia Indexing

IEEE Transactions on Pattern Analysis and Machine Intelligence
High-Performance Distributed Video Content Analysis with Parallel-Horus

IEEE MultiMedia
Program optimization space pruning for a multithreaded gpu

Proceedings of the 6th annual IEEE/ACM international symposium on Code generation and optimization
Roofline: an insightful visual performance model for multicore architectures

Communications of the ACM - A Direct Path to Dependable Software
A Grid framework to enable parallel and concurrent TMA image analyses

International Journal of Grid and Utility Computing
A GPGPU compiler for memory optimization and parallelism management

PLDI '10 Proceedings of the 2010 ACM SIGPLAN conference on Programming language design and implementation
Accelerated 2d image processing on GPUs

ICCS'05 Proceedings of the 5th international conference on Computational Science - Volume Part II
On the use of small 2d convolutions on GPUs

ISCA'10 Proceedings of the 2010 international conference on Computer Architecture

Quantified Score

Hi-index	0.00

Visualization

Abstract

The research domain of Multimedia Content Analysis (MMCA) considers all aspects of the automated extraction of knowledge from multimedia data. High-performance computing techniques are necessary to satisfy the ever increasing computational demands of MMCA applications. The introduction of Graphics Processing Units (GPUs) in modern cluster systems presents application developers with a challenge. While GPUs are well known to be capable of providing significant performance improvements, the programming complexity vastly increases. To this end, we have extended a user transparent parallel programming model for MMCA, named Parallel-Horus, to allow the execution of compute intensive operations on the GPUs present in the cluster. The most important class of operations in the MMCA domain are convolutions, which are typically responsible for a large fraction of the execution time. Existing optimization approaches for CUDA kernels in general as well as those specific to convolution operations are too limited in both performance and flexibility. In this paper, we present a new optimization approach, called adaptive tiling, to implement a highly efficient, yet flexible, library-based convolution operation for modern GPUs. To the best of our knowledge, our implementation is the most optimized and best performing implementation of 2D convolution in the spatial domain available to date.