Effective scheduling techniques for high-level parallel programming languages

Authors:
John H. Reppy;Michael Alan Rainey
Affiliations:
The University of Chicago;The University of Chicago
Venue:
Effective scheduling techniques for high-level parallel programming languages
Year:
2010

Citing 0
Cited 5

Implicitly threaded parallelism in manticore

Journal of Functional Programming
Garbage collection for multicore NUMA machines

Proceedings of the 2011 ACM SIGPLAN Workshop on Memory Systems Performance and Correctness
Oracle scheduling: controlling granularity in implicitly parallel languages

Proceedings of the 2011 ACM international conference on Object oriented programming systems languages and applications
Data-only flattening for nested data parallelism

Proceedings of the 18th ACM SIGPLAN symposium on Principles and practice of parallel programming
The manticore project

Proceedings of the 2nd ACM SIGPLAN workshop on Functional high-performance computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

In the not-so-distant past, parallel programming was mostly the concern of programmers specializing in high-performance computing. Nowadays, on the other hand, many of today's desktop and laptop computers come equipped with a species of shared-memory multiprocessor called a multicore processor, making parallel programming a concern for a much broader range of programmers. High-level parallel languages, such as Parallel ML (PML) and Haskell, seek to reduce the complexity of programming multicore processors by giving programmers abstract execution models, such as implicit threading, where programmers annotate their programs to suggest the parallel decomposition. Implicitly-threaded programs, however, do not specify the actual decomposition of computations or mapping from computations to processors. The annotations act simply as hints that can be ignored and safely replaced with sequential counterparts. The parallel decomposition itself is the responsibility of the language implementation and, more specifically, of the scheduling system. Threads can take arbitrarily different amounts of time to execute, and these times are difficult to predict. Implicit threading encourages the programmer to divide the program into threads that are as small as possible because doing so increases the flexibility the scheduler in its duty to distribute work evenly across processors. The downside of such fine-grain parallelism is that if the total scheduling cost is too large, then parallelism is not worthwhile. This problem is the focus of this dissertation. The starting point of this dissertation is work stealing, a scheduling policy well known for its scalable parallel performance, and the work-first principle, which serves as a guide for building efficient implementations of work stealing. In this dissertation, I present two techniques, Lazy Promotion and Lazy Tree Splitting, for implementing work stealing. Both techniques derive their efficiency from adhering to the work-first principle. Lazy Promotion is a strategy that improves the performance, in terms of execution time, of a work-stealing scheduler by reducing the amount of load the scheduler places on the garbage collector. Lazy Tree Splitting is a technique for automatically scheduling the execution of parallel operations over trees to yield scalable performance and eliminate the need for per-application tuning. I use Manticore, PML's compiler and runtime system, and a sixteen-core NUMA machine as a testbed for these techniques. In addition, I present two empirical studies. In the first study, I evaluate Lazy Promotion over six PML benchmarks. The results demonstrate that Lazy Promotion either outperforms or performs the same as an alternative scheme based on Eager Promotion. This study also evaluates the design of the Manticore runtime system, in particular, the split-heap memory manager, by comparing the system to an alternative system based on a unified-heap memory manager, and showing that the unified version has limited scalability due to poor locality. In the second study, I evaluate Lazy Tree Splitting over seven PML benchmarks by comparing Lazy Tree Splitting to its alternative, Eager Tree Splitting. The results show that, although the two techniques offer similar scalability, only Lazy Tree Splitting is suitable for building an effective language implementation.