A work-stealing scheduler for X10's task parallelism with suspension

Authors:
Olivier Tardieu;Haichuan Wang;Haibo Lin
Affiliations:
IBM T.J. Watson Research Center, Yorktown Heights, NY, USA;University of Illinois at Urbana-Champaign, Urbana, IL, USA;IBM Research - China, Beijing, China
Venue:
Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming
Year:
2012

Citing 15
Cited 3

Lazy task creation: a technique for increasing the granularity of parallel programs

LFP '90 Proceedings of the 1990 ACM conference on LISP and functional programming
Space-efficient scheduling of parallelism with synchronization variables

Proceedings of the ninth annual ACM symposium on Parallel algorithms and architectures
The implementation of the Cilk-5 multithreaded language

PLDI '98 Proceedings of the ACM SIGPLAN 1998 conference on Programming language design and implementation
Scheduling multithreaded computations by work stealing

Journal of the ACM (JACM)
A Java fork/join framework

Proceedings of the ACM 2000 conference on Java Grande
Design, implementation and evaluation of adaptive recompilation with on-stack replacement

Proceedings of the international symposium on Code generation and optimization: feedback-directed and runtime optimization
The Jalapeño virtual machine

IBM Systems Journal
X10: an object-oriented approach to non-uniform cluster computing

OOPSLA '05 Proceedings of the 20th annual ACM SIGPLAN conference on Object-oriented programming, systems, languages, and applications
Solving Large, Irregular Graph Problems Using Adaptive Work-Stealing

ICPP '08 Proceedings of the 2008 37th International Conference on Parallel Processing
Intel threading building blocks

Intel threading building blocks
Beyond nested parallelism: tight bounds on work-stealing overheads for parallel futures

Proceedings of the twenty-first annual symposium on Parallelism in algorithms and architectures
Work-first and help-first scheduling policies for async-finish task parallelism

IPDPS '09 Proceedings of the 2009 IEEE International Symposium on Parallel&Distributed Processing
The design of a task parallel library

Proceedings of the 24th ACM SIGPLAN conference on Object oriented programming systems languages and applications
PFunc: modern task parallelism for modern high performance computing

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Lifeline-based global load balancing

Proceedings of the 16th ACM symposium on Principles and practice of parallel programming

Work-stealing without the baggage

Proceedings of the ACM international conference on Object oriented programming systems languages and applications
X10 and APGAS at Petascale

Proceedings of the 19th ACM SIGPLAN symposium on Principles and practice of parallel programming
GLB: lifeline-based global load balancing library in x10

Proceedings of the first workshop on Parallel programming for analytics applications

Quantified Score

Hi-index	0.00

Visualization

Abstract

The X10 programming language is intended to ease the programming of scalable concurrent and distributed applications. X10 augments a familiar imperative object-oriented programming model with constructs to support light-weight asynchronous tasks as well as execution across multiple address spaces. A crucial aspect of X10's runtime system is the scheduling of concurrent tasks. Work-stealing schedulers have been shown to efficiently load balance fine-grain divide-and-conquer task-parallel program on SMPs and multicores. But X10 is not limited to shared-memory fork-join parallelism. X10 permits tasks to suspend and synchronize by means of conditional atomic blocks and remote task invocations. In this paper, we demonstrate that work-stealing scheduling principles are applicable to a rich programming language such as X10, achieving performance at scale without compromising expressivity, ease of use, or portability. We design and implement a portable work-stealing execution engine for X10. While this engine is biased toward the efficient execution of fork-join parallelism in shared memory, it handles the full X10 language, especially conditional atomic blocks and distribution. We show that this engine improves the run time of a series of benchmark programs by several orders of magnitude when used in combination with the C++ backend compiler and runtime for X10. It achieves scaling comparable to state-of-the art work-stealing scheduler implementations---the Cilk++ compiler and the Java fork/join framework---despite the dramatic increase in generality.