Guided self-scheduling: A practical scheduling scheme for parallel supercomputers
IEEE Transactions on Computers
Load-sharing in heterogeneous systems via weighted factoring
Proceedings of the eighth annual ACM symposium on Parallel algorithms and architectures
Performance of Scheduling Scientific Applications with Adaptive Weighted Factoring
IPDPS '01 Proceedings of the 15th International Parallel & Distributed Processing Symposium
A Class of Loop Self-Scheduling for Heterogeneous Clusters
CLUSTER '01 Proceedings of the 3rd IEEE International Conference on Cluster Computing
A Partitioning Methodology for Accelerating Applications in Hybrid Reconfigurable Platforms
Proceedings of the conference on Design, Automation and Test in Europe - Volume 3
An Enhanced Parallel Loop Self-Scheduling Scheme for Cluster Environments
The Journal of Supercomputing
Efficient Hardware Data Mining with the Apriori Algorithm on FPGAs
FCCM '05 Proceedings of the 13th Annual IEEE Symposium on Field-Programmable Custom Computing Machines
Partitioning Hardware and Software for Reconfigurable Supercomputing Applications: A Case Study
SC '05 Proceedings of the 2005 ACM/IEEE conference on Supercomputing
Distributed loop-scheduling schemes for heterogeneous computer systems: Research Articles
Concurrency and Computation: Practice & Experience
FPGA accelerator for real-time skin segmentation
ESTMED '06 Proceedings of the 2006 IEEE/ACM/IFIP Workshop on Embedded Systems for Real Time Multimedia
Dynamic partitioning of loop iterations on heterogeneous PC clusters
The Journal of Supercomputing
Accelerating Molecular Dynamics Simulations with Reconfigurable Computers
IEEE Transactions on Parallel and Distributed Systems
A performance study of general-purpose applications on graphics processors using CUDA
Journal of Parallel and Distributed Computing
Map-reduce as a Programming Model for Custom Computing Machines
FCCM '08 Proceedings of the 2008 16th International Symposium on Field-Programmable Custom Computing Machines
StarPU: A Unified Platform for Task Scheduling on Heterogeneous Multicore Architectures
Euro-Par '09 Proceedings of the 15th International Euro-Par Conference on Parallel Processing
Qilin: exploiting parallelism on heterogeneous multiprocessors with adaptive mapping
Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture
Axel: a heterogeneous cluster with FPGAs and GPUs
Proceedings of the 18th annual ACM/SIGDA international symposium on Field programmable gate arrays
Multi-GPU and multi-CPU parallelization for interactive physics simulations
Euro-Par'10 Proceedings of the 16th international Euro-Par conference on Parallel processing: Part II
Dynamic multi phase scheduling for heterogeneous cluste
IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Programming framework for clusters with heterogeneous accelerators
ACM SIGARCH Computer Architecture News
Fluidic Kernels: Cooperative Execution of OpenCL Programs on Multiple Heterogeneous Devices
Proceedings of Annual IEEE/ACM International Symposium on Code Generation and Optimization
ad-heap: an Efficient Heap Data Structure for Asymmetric Multicore Processors
Proceedings of Workshop on General Purpose Processing Using GPUs
Hi-index | 0.00 |
Today's heterogeneous architectures bring together multiple general-purpose CPUs and multiple domain-specific GPUs and FPGAs to provide dramatic speedup for many applications. However, the challenge lies in utilizing these heterogeneous processors to optimize overall application performance by minimizing workload completion time. Operating system and application development for these systems is in their infancy. In this article, we propose a new scheduling and workload balancing scheme, HDSS, for execution of loops having dependent or independent iterations on heterogeneous multiprocessor systems. The new algorithm dynamically learns the computational power of each processor during an adaptive phase and then schedules the remainder of the workload using a weighted self-scheduling scheme during the completion phase. Different from previous studies, our scheme uniquely considers the runtime effects of block sizes on the performance for heterogeneous multiprocessors. It finds the right trade-off between large and small block sizes to maintain balanced workload while keeping the accelerator utilization at maximum. Our algorithm does not require offline training or architecture-specific parameters. We have evaluated our scheme on two different heterogeneous architectures: AMD 64-core Bulldozer system with nVidia Fermi C2050 GPU and Intel Xeon 32-core SGI Altix 4700 supercomputer with Xilinx Virtex 4 FPGAs. The experimental results show that our new scheduling algorithm can achieve performance improvements up to over 200% when compared to the closest existing load balancing scheme. Our algorithm also achieves full processor utilization with all processors completing at nearly the same time which is significantly better than alternative current approaches.