Enabling task-level scheduling on heterogeneous platforms

Authors:
Enqiang Sun;Dana Schaa;Richard Bagley;Norman Rubin;David Kaeli
Affiliations:
Northeastern University, Boston, MA;Northeastern University, Boston, MA;Advanced Micro Devices, Boxborough, MA;Advanced Micro Devices, Boxborough, MA;Northeastern University, Boston, MA
Venue:
Proceedings of the 5th Annual Workshop on General Purpose Processing with Graphics Processing Units
Year:
2012

Citing 13
Cited 6

Speeded-Up Robust Features (SURF)

Computer Vision and Image Understanding
Face Components Detection Using SURF Descriptors and SVMs

IMVIP '08 Proceedings of the 2008 International Machine Vision and Image Processing Conference
StarPU: A Unified Platform for Task Scheduling on Heterogeneous Multicore Architectures

Euro-Par '09 Proceedings of the 15th International Euro-Par Conference on Parallel Processing
Qilin: exploiting parallelism on heterogeneous multiprocessors with adaptive mapping

Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture
Performance characterization and optimization of mobile augmented reality on handheld platforms

IISWC '09 Proceedings of the 2009 IEEE International Symposium on Workload Characterization (IISWC)
Computing parallel speeded-up robust features (P-SURF) via POSIX threads

ICIC'09 Proceedings of the 5th international conference on Emerging intelligent computing technology and applications
Multi-GPU and multi-CPU parallelization for interactive physics simulations

Euro-Par'10 Proceedings of the 16th international Euro-Par conference on Parallel processing: Part II
Maestro: data orchestration and tuning for OpenCL devices

Euro-Par'10 Proceedings of the 16th international Euro-Par conference on Parallel processing: Part II
Analyzing program flow within a many-kernel OpenCL application

Proceedings of the Fourth Workshop on General Purpose Processing on Graphics Processing Units
A static task partitioning approach for heterogeneous systems using OpenCL

CC'11/ETAPS'11 Proceedings of the 20th international conference on Compiler construction: part of the joint European conferences on theory and practice of software
A comprehensive analysis and parallelization of an image retrieval algorithm

ISPASS '11 Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software
QR Factorization on a Multicore Node Enhanced with Multiple GPU Accelerators

IPDPS '11 Proceedings of the 2011 IEEE International Parallel & Distributed Processing Symposium
LU factorization for accelerator-based systems

AICCSA '11 Proceedings of the 2011 9th IEEE/ACS International Conference on Computer Systems and Applications

An automatic input-sensitive approach for heterogeneous task partitioning

Proceedings of the 27th international ACM conference on International conference on supercomputing
LibWater: heterogeneous distributed computing made easy

Proceedings of the 27th international ACM conference on International conference on supercomputing
Load balancing in a changing world: dealing with heterogeneity and performance variability

Proceedings of the ACM International Conference on Computing Frontiers
ViperVM: a runtime system for parallel functional high-performance computing on heterogeneous architectures

Proceedings of the 2nd ACM SIGPLAN workshop on Functional high-performance computing
Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles

ACM SIGOPS 24th Symposium on Operating Systems Principles
Dandelion: a compiler and runtime for heterogeneous systems

Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles

Quantified Score

Hi-index	0.00

Visualization

Abstract

OpenCL is an industry standard for parallel programming on heterogeneous devices. With OpenCL, compute-intensive portions of an application can be offloaded to a variety of processing units within a system. OpenCL is the first standard that focuses on portability, allowing programs to be written once and run seamlessly on multiple, heterogeneous devices, regardless of vendor. While OpenCL has been widely adopted, there still remains a lack of support for automatic task scheduling and data consistency when multiple devices appear in the system. To address this need, we have designed a task queueing extension for OpenCL that provides a high-level, unified execution model tightly coupled with a resource management facility. The main motivation for developing this extension is to provide OpenCL programmers with a convenient programming paradigm to fully utilize all possible devices in a system and incorporate flexible scheduling schemes. To demonstrate the value and utility of this extension, we have utilized an advanced OpenCL-based imaging toolkit called clSURF. Using our task queueing extension, we demonstrate the potential performance opportunities and limitations given current vendor implementations of OpenCL. Using a state-of-art implementation on a single GPU device as the baseline, our task queueing extension achieves a speedup up to 72.4%. Our extension also achieves scalable performance gains on multiple heterogeneous GPU devices. The performance trade-offs of using the host CPU as an accelerator are also evaluated.