A dynamic self-scheduling scheme for heterogeneous multiprocessor architectures

Authors:
Mehmet E. Belviranli;Laxmi N. Bhuyan;Rajiv Gupta
Affiliations:
University of California, Riverside;University of California, Riverside;University of California, Riverside
Venue:
ACM Transactions on Architecture and Code Optimization (TACO) - Special Issue on High-Performance Embedded Architectures and Compilers
Year:
2013

Citing 21
Cited 2

Guided self-scheduling: A practical scheduling scheme for parallel supercomputers

IEEE Transactions on Computers
Load-sharing in heterogeneous systems via weighted factoring

Proceedings of the eighth annual ACM symposium on Parallel algorithms and architectures
Performance of Scheduling Scientific Applications with Adaptive Weighted Factoring

IPDPS '01 Proceedings of the 15th International Parallel & Distributed Processing Symposium
A Class of Loop Self-Scheduling for Heterogeneous Clusters

CLUSTER '01 Proceedings of the 3rd IEEE International Conference on Cluster Computing
A Partitioning Methodology for Accelerating Applications in Hybrid Reconfigurable Platforms

Proceedings of the conference on Design, Automation and Test in Europe - Volume 3
An Enhanced Parallel Loop Self-Scheduling Scheme for Cluster Environments

The Journal of Supercomputing
Efficient Hardware Data Mining with the Apriori Algorithm on FPGAs

FCCM '05 Proceedings of the 13th Annual IEEE Symposium on Field-Programmable Custom Computing Machines
Partitioning Hardware and Software for Reconfigurable Supercomputing Applications: A Case Study

SC '05 Proceedings of the 2005 ACM/IEEE conference on Supercomputing
Distributed loop-scheduling schemes for heterogeneous computer systems: Research Articles

Concurrency and Computation: Practice & Experience
FPGA accelerator for real-time skin segmentation

ESTMED '06 Proceedings of the 2006 IEEE/ACM/IFIP Workshop on Embedded Systems for Real Time Multimedia
Dynamic partitioning of loop iterations on heterogeneous PC clusters

The Journal of Supercomputing
NVIDIA Tesla: A Unified Graphics and Computing Architecture

IEEE Micro
Accelerating Molecular Dynamics Simulations with Reconfigurable Computers

IEEE Transactions on Parallel and Distributed Systems
A performance study of general-purpose applications on graphics processors using CUDA

Journal of Parallel and Distributed Computing
Map-reduce as a Programming Model for Custom Computing Machines

FCCM '08 Proceedings of the 2008 16th International Symposium on Field-Programmable Custom Computing Machines
StarPU: A Unified Platform for Task Scheduling on Heterogeneous Multicore Architectures

Euro-Par '09 Proceedings of the 15th International Euro-Par Conference on Parallel Processing
Qilin: exploiting parallelism on heterogeneous multiprocessors with adaptive mapping

Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture
Axel: a heterogeneous cluster with FPGAs and GPUs

Proceedings of the 18th annual ACM/SIGDA international symposium on Field programmable gate arrays
Multi-GPU and multi-CPU parallelization for interactive physics simulations

Euro-Par'10 Proceedings of the 16th international Euro-Par conference on Parallel processing: Part II
Dynamic multi phase scheduling for heterogeneous cluste

IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Programming framework for clusters with heterogeneous accelerators

ACM SIGARCH Computer Architecture News

Fluidic Kernels: Cooperative Execution of OpenCL Programs on Multiple Heterogeneous Devices

Proceedings of Annual IEEE/ACM International Symposium on Code Generation and Optimization
ad-heap: an Efficient Heap Data Structure for Asymmetric Multicore Processors

Proceedings of Workshop on General Purpose Processing Using GPUs

Quantified Score

Hi-index	0.00

Visualization

Abstract

Today's heterogeneous architectures bring together multiple general-purpose CPUs and multiple domain-specific GPUs and FPGAs to provide dramatic speedup for many applications. However, the challenge lies in utilizing these heterogeneous processors to optimize overall application performance by minimizing workload completion time. Operating system and application development for these systems is in their infancy. In this article, we propose a new scheduling and workload balancing scheme, HDSS, for execution of loops having dependent or independent iterations on heterogeneous multiprocessor systems. The new algorithm dynamically learns the computational power of each processor during an adaptive phase and then schedules the remainder of the workload using a weighted self-scheduling scheme during the completion phase. Different from previous studies, our scheme uniquely considers the runtime effects of block sizes on the performance for heterogeneous multiprocessors. It finds the right trade-off between large and small block sizes to maintain balanced workload while keeping the accelerator utilization at maximum. Our algorithm does not require offline training or architecture-specific parameters. We have evaluated our scheme on two different heterogeneous architectures: AMD 64-core Bulldozer system with nVidia Fermi C2050 GPU and Intel Xeon 32-core SGI Altix 4700 supercomputer with Xilinx Virtex 4 FPGAs. The experimental results show that our new scheduling algorithm can achieve performance improvements up to over 200% when compared to the closest existing load balancing scheme. Our algorithm also achieves full processor utilization with all processors completing at nearly the same time which is significantly better than alternative current approaches.