Automatic OpenCL device characterization: guiding optimized kernel design

Authors:
Peter Thoman;Klaus Kofler;Heiko Studt;John Thomson;Thomas Fahringer
Affiliations:
University of Innsbruck;University of Innsbruck;University of Innsbruck;University of Innsbruck;University of Innsbruck
Venue:
Euro-Par'11 Proceedings of the 17th international conference on Parallel processing - Volume Part II
Year:
2011

Citing 10
Cited 5

Multigrid

Multigrid
False Sharing and Spatial Locality in Multiprocessor Caches

IEEE Transactions on Computers
Very Long Instruction Word architectures and the ELI-512

ISCA '83 Proceedings of the 10th annual international symposium on Computer architecture
The Future of Microprocessors

Queue - Multiprocessors
Performance analysis and workload characterization of the 3DMark05 benchmark on modern parallel computer platforms

ACM SIGARCH Computer Architecture News - Special issue on the 2006 reconfigurable and adaptive architecture workshop
Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow

Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture
Benchmarking GPUs to tune dense linear algebra

Proceedings of the 2008 ACM/IEEE conference on Supercomputing
Rodinia: A benchmark suite for heterogeneous computing

IISWC '09 Proceedings of the 2009 IEEE International Symposium on Workload Characterization (IISWC)
The Scalable Heterogeneous Computing (SHOC) benchmark suite

Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units
A GPGPU compiler for memory optimization and parallelism management

PLDI '10 Proceedings of the 2010 ACM SIGPLAN conference on Programming language design and implementation

Automatic problem size sensitive task partitioning on heterogeneous parallel systems

Proceedings of the 18th ACM SIGPLAN symposium on Principles and practice of parallel programming
Mastering software variant explosion for GPU accelerators

Euro-Par'12 Proceedings of the 18th international conference on Parallel processing workshops
An automatic input-sensitive approach for heterogeneous task partitioning

Proceedings of the 27th international ACM conference on International conference on supercomputing
Easy, fast, and energy-efficient object detection on heterogeneous on-chip architectures

ACM Transactions on Architecture and Code Optimization (TACO)
Test-driving Intel Xeon Phi

Proceedings of the 5th ACM/SPEC international conference on Performance engineering

Quantified Score

Hi-index	0.00

Visualization

Abstract

The OpenCL standard allows targeting a large variety of CPU, GPU and accelerator architectures using a single unified programming interface and language. While the standard guarantees portability of functionality for complying applications and platforms, performance portability on such a diverse set of hardware is limited. Devices may vary significantly in memory architecture as well as type, number and complexity of computational units. To characterize and compare the OpenCL performance of existing and future devices we propose a suite of microbenchmarks, uCLbench. We present measurements for eight hardware architectures - four GPUs, three CPUs and one accelerator - and illustrate how the results accurately reflect unique characteristics of the respective platform. In addition to measuring quantities traditionally benchmarked on CPUs like arithmetic throughput or the bandwidth and latency of various address spaces, the suite also includes code designed to determine parameters unique to OpenCL like the dynamic branching penalties prevalent on GPUs. We demonstrate how our results can be used to guide algorithm design and optimization for any given platform on an example kernel that represents the key computation of a linear multigrid solver. Guided manual optimization of this kernel results in an average improvement of 61% across the eight platforms tested.