CAP: co-scheduling based on asymptotic profiling in CPU+GPU hybrid systems

Authors:
Zhenning Wang;Long Zheng;Quan Chen;Minyi Guo
Affiliations:
Shanghai Jiao Tong University, Shanghai, China;Shanghai Jiao Tong University, Shanghai, China and The University of Aizu, Aizu-wakamatsu, Japan;Shanghai Jiao Tong University, Shanghai, China;Shanghai Jiao Tong University, Shanghai, China
Venue:
Proceedings of the 2013 International Workshop on Programming Models and Applications for Multicores and Manycores
Year:
2013

Citing 25
Cited 1

OpenMP: An Industry-Standard API for Shared-Memory Programming

IEEE Computational Science & Engineering
Fast and Effective Task Scheduling in Heterogeneous Systems

HCW '00 Proceedings of the 9th Heterogeneous Computing Workshop
Improving Scheduling of Tasks in a Heterogeneous Environment

IEEE Transactions on Parallel and Distributed Systems
Brook for GPUs: stream computing on graphics hardware

ACM SIGGRAPH 2004 Papers
Scout: a data-parallel programming language for graphics processors

Parallel Computing
MapReduce: simplified data processing on large clusters

Communications of the ACM - 50th anniversary issue: 1958 - 2008
Optimization principles and application performance evaluation of a multithreaded GPU using CUDA

Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming
Predictive Runtime Code Scheduling for Heterogeneous Architectures

HiPEAC '09 Proceedings of the 4th International Conference on High Performance Embedded Architectures and Compilers
Tuned and wildly asynchronous stencil kernels for hybrid CPU/GPU systems

Proceedings of the 23rd international conference on Supercomputing
An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness

Proceedings of the 36th annual international symposium on Computer architecture
CellMR: A framework for supporting mapreduce on asymmetric cell-based clusters

IPDPS '09 Proceedings of the 2009 IEEE International Symposium on Parallel&Distributed Processing
Qilin: exploiting parallelism on heterogeneous multiprocessors with adaptive mapping

Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture
An adaptive performance modeling tool for GPU architectures

Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
An integrated GPU power and performance model

Proceedings of the 37th annual international symposium on Computer architecture
Debunking the 100X GPU vs. CPU myth: an evaluation of throughput computing on CPU and GPU

Proceedings of the 37th annual international symposium on Computer architecture
MOON: MapReduce On Opportunistic eNvironments

Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing
StarPU: a unified platform for task scheduling on heterogeneous multicore architectures

Concurrency and Computation: Practice & Experience - Euro-Par 2009
Mars: Accelerating MapReduce with Graphics Processors

IEEE Transactions on Parallel and Distributed Systems
A static task partitioning approach for heterogeneous systems using OpenCL

CC'11/ETAPS'11 Proceedings of the 20th international conference on Compiler construction: part of the joint European conferences on theory and practice of software
A quantitative performance analysis model for GPU architectures

HPCA '11 Proceedings of the 2011 IEEE 17th International Symposium on High Performance Computer Architecture
OpenMP for accelerators

IWOMP'11 Proceedings of the 7th international conference on OpenMP in the Petascale era
Productive cluster programming with OmpSs

Euro-Par'11 Proceedings of the 17th international conference on Parallel processing - Volume Part I
Statistical performance comparisons of computers

HPCA '12 Proceedings of the 2012 IEEE 18th International Symposium on High-Performance Computer Architecture
Heterogeneous Task Scheduling for Accelerated OpenMP

IPDPS '12 Proceedings of the 2012 IEEE 26th International Parallel and Distributed Processing Symposium
WATS: Workload-Aware Task Scheduling in Asymmetric Multi-core Architectures

IPDPS '12 Proceedings of the 2012 IEEE 26th International Parallel and Distributed Processing Symposium

Dynamic Partitioning-based JPEG Decompression on Heterogeneous Multicore Architectures

Proceedings of Programming Models and Applications on Multicores and Manycores

Quantified Score

Hi-index	0.00

Visualization

Abstract

Hybrid systems with CPU and GPU have become the new standard in high performance computing. Workloads are split into two parts and distributed to different devices to utilize both CPU and GPU for data parallelism in hybrid systems. But it is challenging for users to manually balance workload between CPU and GPU since GPU is sensitive to the scale of the problem. Therefore, current dynamic schedulers balance workload between CPU and GPU periodically and dynamically. The periodical balance operation causes frequent synchronizations between CPU and GPU and the synchronizations often degrade the overall performance. To solve the problem, we propose a Co-Scheduling Strategy Based on Asymptotic Profiling (CAP). CAP dynamically splits one task's workload to CPU and GPU and adopts the profiling technique to predict the workload in next partition. CAP is optimized for GPU's performance characteristics to balance workload between CPU and GPU with only a few synchronizations. We examine our proof-of-concept system with four benchmarks and results show that CAP produces up to 45.1% performance improvement compared with the state-of-art co-scheduling strategy.