Load balancing on speed

Authors:
Steven Hofmeyr;Costin Iancu;Filip Blagojević
Affiliations:
Lawrence Berkeley National Laboratory, Berkeley, CA, USA;Lawrence Berkeley National Laboratory, Berkeley, CA, USA;Lawrence Berkeley National Laboratory, Berkeley, USA
Venue:
Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
Year:
2010

Citing 24
Cited 8

The impact of operating system scheduling policies and synchronization methods of performance of parallel applications

SIGMETRICS '91 Proceedings of the 1991 ACM SIGMETRICS conference on Measurement and modeling of computer systems
The implications of cache affinity on processor scheduling for multiprogrammed, shared memory multiprocessors

SOSP '91 Proceedings of the thirteenth ACM symposium on Operating systems principles
Impact of sharing-based thread placement on multithreaded architectures

ISCA '94 Proceedings of the 21st annual international symposium on Computer architecture
Processor Sharing Queueing Models of Mixed Scheduling Disciplines for Time Shared System

Journal of the ACM (JACM)
Symbiotic jobscheduling for a simultaneous multithreaded processor

ASPLOS IX Proceedings of the ninth international conference on Architectural support for programming languages and operating systems
Performance characteristics of gang scheduling in multiprogrammed environments

SC '97 Proceedings of the 1997 ACM/IEEE conference on Supercomputing
Using Processor-Cache Affinity Information in Shared-Memory Multiprocessor Scheduling

IEEE Transactions on Parallel and Distributed Systems
An Integrated Approach to Parallel Scheduling Using Gang-Scheduling, Backfilling, and Migration

IEEE Transactions on Parallel and Distributed Systems
System noise, OS clock ticks, and fine-grained parallel applications

Proceedings of the 19th annual international conference on Supercomputing
Adaptive scheduling with parallelism feedback

Proceedings of the eleventh ACM SIGPLAN symposium on Principles and practice of parallel programming
ULE: a modern scheduler for FreeBSD

BSDC'03 Proceedings of the BSD Conference 2003 on BSD Conference
Backfilling Using System-Generated Predictions Rather than User Runtime Estimates

IEEE Transactions on Parallel and Distributed Systems
Implementing lottery scheduling: matching the specializations in traditional schedulers

ATEC '99 Proceedings of the annual conference on USENIX Annual Technical Conference
Thread clustering: sharing-aware scheduling on SMP-CMP-SMT multiprocessors

Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007
Efficient operating system scheduling for performance-asymmetric multi-core architectures

Proceedings of the 2007 ACM/IEEE conference on Supercomputing
Characterizing application sensitivity to OS interference using kernel-level noise injection

Proceedings of the 2008 ACM/IEEE conference on Supercomputing
PAM: a novel performance/power aware meta-scheduler for multi-core systems

Proceedings of the 2008 ACM/IEEE conference on Supercomputing
A dynamic scheduler for balancing HPC applications

Proceedings of the 2008 ACM/IEEE conference on Supercomputing
Prediction-Based Power-Performance Adaptation of Multithreaded Scientific Codes

IEEE Transactions on Parallel and Distributed Systems
The PARSEC benchmark suite: characterization and architectural implications

Proceedings of the 17th international conference on Parallel architectures and compilation techniques
Efficient and scalable multiprocessor fair scheduling using distributed weighted round-robin

Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming
Tessellation: space-time partitioning in a manycore client OS

HotPar'09 Proceedings of the First USENIX conference on Hot topics in parallelism
Optimizing collective communication on multicores

HotPar'09 Proceedings of the First USENIX conference on Hot topics in parallelism
Corey: an operating system for many cores

OSDI'08 Proceedings of the 8th USENIX conference on Operating systems design and implementation

Juggle: proactive load balancing on multicore computers

Proceedings of the 20th international symposium on High performance distributed computing
Improving job scheduling algorithms in a grid environment

Future Generation Computer Systems
Survey of scheduling techniques for addressing shared resources in multicore processors

ACM Computing Surveys (CSUR)
Dynamic threshold for imbalance assessment on load balancing for multicore systems

Computers and Electrical Engineering
Uncovering CPU load balancing policies with harmony

Proceedings of the ACM International Conference on Computing Frontiers
Juggle: addressing extrinsic load imbalances in SPMD applications on multicore computers

Cluster Computing
Load balancing non-uniform parallel computations

Proceedings of the 2013 workshop on Programming based on actors, agents, and decentralized control
Adaptive workload-aware task scheduling for single-ISA asymmetric multicore architectures

ACM Transactions on Architecture and Code Optimization (TACO)

Quantified Score

Hi-index	0.00

Visualization

Abstract

To fully exploit multicore processors, applications are expected to provide a large degree of thread-level parallelism. While adequate for low core counts and their typical workloads, the current load balancing support in operating systems may not be able to achieve efficient hardware utilization for parallel workloads. Balancing run queue length globally ignores the needs of parallel applications where threads are required to make equal progress. In this paper we present a load balancing technique designed specifically for parallel applications running on multicore systems. Instead of balancing run queue length, our algorithm balances the time a thread has executed on ``faster'' and ``slower'' cores. We provide a user level implementation of speed balancing on UMA and NUMA multi-socket architectures running Linux and discuss behavior across a variety of workloads, usage scenarios and programming models. Our results indicate that speed balancing when compared to the native Linux load balancing improves performance and provides good performance isolation in all cases considered. Speed balancing is also able to provide comparable or better performance than DWRR, a fair multi-processor scheduling implementation inside the Linux kernel. Furthermore, parallel application performance is often determined by the implementation of synchronization operations and speed balancing alleviates the need for tuning the implementations of such primitives.